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RESEARCH DESIGN IN CLINICAL PSYCHOLOGY 


INTRODUCTION 

Clinical psychology has now passed through its embryonic and foetal stages 
and been launched into the status of a full fledged clinical science. During the period 
of its-birth and formal organization during the last ten years, it was necessary for 
the young profession to gather its subject matter from many sources, some scientific 
but many prescientific. The excitement attendant upon the birth of the new discip- 
line has now died down, and its future would seem to depend upon the calibre of 
the scientific work which becomes associated with it. It therefore seems important 
to direct our attention to the pressing problems of research design in clinical psychol- 
ogy because here is the keystone of all scientific progress. We need to make a critical 
survey of existing methods and to discover new ones which are particularly suited 
to the handling of clinical data. Fortunately, clinical psychology finds itself in the 
position of being able to borrow from the experience of older disciplines such as ex- 
perimental medicine and education. The first step, then, would appear to involve a 
careful analysis of the historical development of assessment methods in the older 
sciences. A second step will consist in the evaluation of the potential contributions 
of basic science psychology, and particularly experimental and statistical methods, 
in order to adopt anything which seems applicable to the clinical situation. The 
third step will involve the integration and modification of existing methods into re- 
search designs which facilitate the study of the individual case. Finally, and much 
more difficult, will be the invention of new methods particularly suited to our prob- 
lems. 

The present symposium consists of two groups of papers presented originally 
in symposia at meetings of the AMERICAN PsycHOLOGICAL AssocIATION and the 
MIpWESTERN PsycHoLoGicaL Association. The first group including papers by 
Ca?TTELL, ELuis, SCHOFIELD, SELLS, WATSON and WITTENBORN was presented on 
August 31, 1951 at the annual meeting of the AmertcAN PsycHoLoGIcAL ASSOCIA- 
TION in a symposium entitled “A Critical Evaluation of Research Techniques in 
Clinical Psychology”. In addition, we have included two short papers by THORNE 
which were originally intended as part of the Chairman’s comments at the Sym- 
posium but which were not given at the meeting because of time limitations. The 
second group of papers by BerG, Epwarps and CRONBACH, and WATSON consist of a 
revised form of a symposium on ‘‘ Evaluating the Effectiveness of Psychotherapy” 
originally presented before the MipwEstErRN PsycHoLoGicaL AssoctaTIon on April 
27, 1951. 

In commenting on this collection of papers, it is encouraging to note the critical- 

ness and originality of thought which is being devoted to the problem. Professional 
psychologists have demonstrated themselves to be vary adept in handling problems 
of methodology in the past, and there is every reason to believe that research design 
in clinical psychology will quickly surpass anything which has been previously ac- 
complished. On this point, however, one is stimulated to speculate on how medical 
science has been able to validate its methods using methods which are relatively 
gross when compared with the refinements of modern experimental designs and such 
statistical accomplishments as factor analysis. It appears that the large scale col- 
lection of data based on clinical opinions with relatively simple statistical analysis 
may produce results which are just as valid as the more intensive statistical treat- 
ment of small sample data derived from rigidly controlled experimental situations 
such as have been typical of psychological research. 

















RESEARCH DESIGN IN CLINICAL PSYCHOLOGY 


We would also like to comment upon the importance of well-formulated system- 
atic theory as a necessary prerequisite for the creation of meaningful hypotheses 
from which significant research will flow. Much research in clinical psychology has 
been relatively meaningless or even misleading because of the error of creating re- 
search designs in defense of untenable theoretical positions. The paper by Edwards 
and Cronbach is particularly apropos on this point. As they point out, statistical 
methods are not foolproof and actually can be twisted to prove almost any desired 
point. Of what value are conclusions from experiments purporting to study “ psy- 
choneurotics” or ‘‘schizophrenics” when these terms have not yet been sufficiently 


delineated so as to provide homogeneous groupings? 





P-TECHNIQUE FACTORIZATION AND THE DETERMINATION OF 
INDIVIDUAL DYNAMIC STRUCTURE 


R. B, CATTELL 


University of Illinois 


I. Tue CiinicaL APPROACH AND CANONS OF SCIENTIFIC METHOD 


One of the fondest illusions of the clinician, for which he has paid heavily in loss 
of research time and effort, is the belief that there is a ‘‘clinical method” of research. 
Although there is a clinical method of treatment, in the realm of applied psychology, 
the methods of research in the clinical realm are either the methods of experiment or 
the methods of statistics—or nothing! 

During the last decade, as clinical psychologists began to address themselves to 
serious objective research, they tended however, to turn to the traditional “‘ classical” 
experimental design, being apparently unaware of the special possibilities for their 
particular field, presented by the newest statistical methods. While the intention 
to maintain strict experimental standards is very laudable, the results have not been 
of the happiest, for clinicians are mainly concerned with happenings that cannot be 
torn out of context and put into the laboratory. Many of the processes with which 
they are concerned, e.g. the process of repression, just cannot be brought about in 
its natural form in a controlled experimental situation. Fortunately, clinicians seem 
now on the brink of realizing that their best hopes of making substantial progress in 
the scientific investigation of clinical problems lies in the realm of development of 
more refined and powerful statistical methods. It is the purpose of this paper to 
discuss and evaluate some of the most promising of those methods. 

Parenthetically, it is desirable to clarify the antithesis drawn above between 
experimental and statistical methodology. Admittedly most research has some de- 
gree of statistical analysis, but in the ultimate extremes we have on the one hand a 
crucial experiment in which everything is held constant except an independent var- 
iable, while this is varied in a controled way to see what happens to the dependent 
variable. In its purest form such an experiment could yield an answer without any 
statistics whatever, by the cogency of a single case. At the other extreme we have 
the situation where there is absolutely no control of the data, but where a number 
of variables are observed in their natural variations and the relationships of depend- 
ent and independent variables are worked out by allowing, statistically, for the oper- 
ation of variables which cannot be held constant. 

Of the statistical devices which deal with happenings zn situ, as in happenings 
to society or to an individual in a clinical situation, the most potent are factor an- 
alysis and the analysis of variance. The former, however, should make a greater 
appeal to clinical psychologists, because it is not restricted to a single dependent 
variable, and because it asks not only whether a relationship is statistically signi- 
ficant, but also whether it is psychologically significant, i.e. it sets out to determine 
the closeness of a relationship over and above the degree of its statistical signi- 
ficance. As the reader knows, in a general if not in a specific way, factor analysis 
sets out to intercorrelate a great number of variables and to find a more limited num- 
ber of basic factors or influences the variation in which is responsible for the inter- 
correlations observed among the variables. Such factors do not necessarily cor- 
respond to any one variable and indeed in general have to be inferred from a cluster 
of variables co-varying in such a way that one can hypothesize an influence back of 
all of them. By reason of being able to deal with many variables it is the ideal de- 
vice to deal with the “total personality”, for at least it may sample among the var- 
iables the principal aspects of personality and bring them all into the single per- 
spective of the specification equation. 
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II. PSYCHOANALYSIS AS Facror ANALYSIS 


At this point the clinical reader may have recovered sufficiently from my open- 
ing paragraph to assert that there is such a thing as a clinical method of research, 
and he wili add that by this he means something more substantial than endless spec- 
ulation and “intuition” or the prerogatives of a crystal ball. He will point out, that 
in psychology as in medicine, acute powers of observation combined with a very 
good memory have succeeded, for example, in establishing certain invariable se- 
quences, i.e. in pointing to causality, and have also resulted in the recognition of 
certain syndromes which repeatedly recur and which have only much later had their 
patterns established by statistical analysis. This is quite true, but it actually 
supports the argument for the non-existence of an independent clinical method. 
When the clinician behaves in this way he is being a statistician without benefit of a 
computing machine. The I. B. M. ecards and the computing machine have been 
operating in his head, and he has succeeded in extracting correlations, in recog- 
nizing invariable sequences and in appreciating factor loading patterns by exactly 
the same quantitative principles as are used by the statistician, though at an un- 
conscious level of inference. Anyone who looks into the history of clinical medicine 
over the last 500 years cannot but be awestruck at the statistical feats performed by 
the sure memories and balanced judgments of devoted clinical workers who have 
sifted the experience of a lifetime in such a way as to reveal patterns of so delicate 
an association that one would expect their revelation to be possible only by the most 
refined statistical analysis. 


Clinical analysis is therefore statistical analysis, and as I hope to show in a 
moment, psychoanalysis is essentially factor analysis. If clinical research consists 
essentially in applying refined statistical methods to observation in situ with only 
occasional resort, to, ancillary experimental control, then the sooner the clinician 
brings to his aid the memory power of 1. B. M. records (or any other records), the 


analytical power of the modern computing machine and the resources of factor 
analytic procedure which vitalize these, the sooner shall we stride out of the present 
morass of fruitless verbal contention. When a psycho-analyst talks about an ego, 
a super ego or an Oedipus complex, I have known eminent experimental psycholo- 
gists to ask brutally to be shown one of these objects. The experimentalist is right 
in asking for scientific evidence, and insinuating that the whole of psychoanalytic 
theory could be the merest speculation—indeed he will readily find clinicians of 
different schools who will assure him explicitly that these ‘‘concepts” are the merest 
invention—but he is as much at fault as the psychoanalyst if he assumes that the 
answer is going to be given him in terms of his own, particular subdivision of scientific 
method—namely, controlled experiment. 

When the psychoanalyst arrives at the conclusion that there exists a dynamic 
structure within personality called a super ego he means that there are certain kinds 
of behavior, quantifiable in verbal and non-verbal terms, which ‘go together’’, i.e. 
increase together or decrease together as one goes from one patient to another. This 
pattern within responses becomes visible only because people differ in the amount 
of super ego they possess, or because within any one individual the super ego var- 
iables are noticeably more active at one time than at another, thus standing out from 
non-dynamic phenomena or other dynamic patterns in another phase of variation. 
If this is what the psychoanalyst is doing, the same could be done, though admittedly 
with much greater labor, by actually measuring patients on these variables and 
demonstrating by factor analysis that a single unitary influence underlies the varying 
manifestations of what is called the super ego. Indeed, it does not seem to be 
sufficiently known among psychoanalysts that the factorization of personality var- 
iables and performances has already demonstrated the existence of the pattern of 
super ego strength®: ©, of ego strength “: * *, of some personality patterns asso- 
ciated with fixation “® ', and even the independence of a variety of ego defense 
mechanisms “?, 
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III. Tue INsurriciency or Factor ANALYTIC R-TECHNIQUE 

If I am asked critically to evaluate trends in clinical research, therefore, my 
first and most serious criticism would be that clinical psychology, and specifically 
psychoanalysis, have failed abysmally to realize the existence of those statistical 
developments which can save their whole mode of thought from the charge that it 
lies outside the bounds of scientific psychology. Moreover, they have even failed 
to recognize that in factor analytic data available in the last ten years there exist 
actual patterns, such as the G factor of super ego strength, C factor of ego strength, 
the various ergic drive patterns “> and defense mechanisms which, with a little tidy- 
ing up by further research, would make it possible for them to measure and subject 
to precise investigation the conceptual entities about which they can still disagree as 
freely as they can talk. 

Although factor analysis and its developments unquestionably provide the firm 
frame work at present hidden in the looser methods used by that extremely import- 
ant approach which we may call the clinical mode of thought, and although the 
mastery of such methods would provide for the clinician a passport from intellectual 
nomadism to a settled, progressive architectonic growth of scientific knowledge, 
nevertheless a critique of factor analysis itself must now be given. This critique need 
not be occupied with those finer points of pure statistics which are in dispute among 
factor analysts and which tend to distract the beginner in the art out of all pro- 
portion to their importance. The issues of this nature which have practical im- 
portance are rapidly in process of being settled while the rest are likely to be settled 
by professional statisticians in their own good time. What we are concerned with 
here is rather the psychological meaning of those basic features of the logic of factor 
analysis which are widely accepted and agreed upon. (Some of the more disputed 
issues have been dealt with in a former article in this Journal ?), 

In this critique of factor analysis it would be a good thing to begin with the 
difficulties raised by such clinicians as have already become aware of the bearing of 
factor analysis upon their research. Two major objections have repeatedly been 
urged against factor analysis: (1) that it is a complex and tedious procedure, the re- 
quirements of which in regard to population, reliability of measurement, etc. can 
seldom be met in clinical data, and (2) that it talks about. common traits and com- 
mon factors and does not cope with the uniqueness of the individual case. 

As to the first objection and the numerous rationalizations associated with it, 
by which the clinician seeks to avoid having to learn a new and seemingly difficult 
technique—I will say nothing here. As to the second, we must recognize that 
“unique” has two senses in psychological measurement. In the first sense it is not 
correct to say that ordinary factor analysis loses the uniqueness of the individual. 
He is represented by a unique combination of common traits, i.e. as a unique point 
in space relative to a common system of axes in hyper-space. In the second sense, 
in which we speak of the uniqueness of the factor loading, or mode of expression of a 
general trait within the individual, the ordinary technique of factor analysis admitted- 
ly fails. It does not reveal the whole variance in relation to the specific and historical 
attachments of dynamic traits in the individual. But the new factor analytic pro- 
cedure known as P-technique, i.e. the factorization of the individual case, gives pre- 
cisely such an analysis and description of unique personality traits, indeed that for 
which the clinician and the theorist in personality have always been asking. Let 
us examine this difference between the classical or R-technique in factor analysis 
and the new P-technique. 


IV. THe P-rEcHNIQUE ANALYSIS OF INDIVIDUAL DYNAMICS 


In ordinary factor analysis, which correlates variables upon a lot of people and 
is known as R-technique, any individual factor or source trait is estimated for a 
particular person by adding together his scores on say ten variables all of which are 
loaded in that particular factor. The same kind of total or average is computed for 
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all people, as is exemplified in estimating the factor source trait of intelligence by 
adding together the scores on various standard sub-tests (the same for everyone) all 
of which are known to be highly loaded (by R-technique) in the factor of general 
mental capacity. There is no particular objection to this for the factor of intelli- 
gence, but when the clinician turns to dynamic traits, e.g. to the factor of dominance, 
or of anxiety, he is interested not only in knowing how dominant the individual is, or 
how much anxiety he has, but also in knowing what particular behavior manifesta- 
tions are the principal means of expressing that dominance or anxiety for the particu- 
Jar individual concerned. 

The clinician is therefore right in criticizing R-technique as not giving him 
exactly what he wants—or at least not all that he wants. He wants something more 
than the scores of a person on a personality factor profile. Indeed, he wants the 
individual historical fixations of each particular factor.+ 

However, it may be objected that in most clinical research there has been a 
failure to distinguish (in criticisms of factor analysis) between the fact that common 
factor measurements do not give all that a clinician wants and the fact that they do 
give him something, indeed, a something which he cannot afford to go without. For 
in his own use of general terms such as ego strength, amount of anxiety, ete., he is, 
strictly, speaking about statistical common traits. He is for the moment forgetting 
the particular investments of the dynamic trait in the individual’s historical idiosyn- 
crasies and he is saying that one man has a stronger ego or that he is subject to more 
anxiety than another. That is to say, he is estimating the common trait endowment 
of each individual in terms of the addition and averaging of a wide variety of parti- 
cular manifestations thus reduced to a single measure of general level. 

I would argue that the clinician is right in what he does even though this means 
that he cannot be right in what he says, for these overall, general dimensions of a 
personality are of great importance in predicting how the patient will react to a 
variety of circumstances, even though we do not know the specific behavior invest- 
ments for the specific individual. For example, just as we find it important in child 
psychology, for understanding the adjustment problems of the child, to know his 
level on a general intelligence test, so also we need to know in most clinical cases the 
average level of the person’s ego strength, his super ego strength, etc., if we are to 
understand and predict the outcome of certain conflicts or the therapeutic outlook 
if he is subjected to certain clinical treatments. It has taken the psychoanalyst, 
dealing with adults, a long time to realize that a measure of the patient’s general 
intelligence is relevant to his procedure and it seemingly will take still longer for 
him to realize that measurements on other common trait dimensions of personality 
must also be taken accurately into account—as common traits and in a systematic 
specification equation—if he is to have more success with his patients. 

Clinical research and clinical practice would thus find the present measurement 
of R-technique factors®? useful in completing that general ‘‘sizing up” of the in- 
dividual which is the first phase of diagnosis. But in the second and more refined 
stage of diagnosis—namely, the determination of the particular, historical attach- 
ments of general dynamic source traits, and the analysis of symptoms--the alternative 
form of factor analysis, namely P-technique, is absolutely essential. For P-tech- 
nique does with accuracy in the dynamic field what free association and similar ap- 
proaches do by various combinations of intuition and guess work. It has been shown 
in recent research “> "+ 1) that by measuring a patient on a certain set of dynamic 
traits, preferably by objective test methods, and recording strengths on these var- 
inbles from day to day over about 100 days, one can obtaip a series of cor- 
relations among the variables which permit a factor analysis of the total individual 
personality structure. 


The factors found by this longitudinal analysis turn out to be in the first place 
the independent drives which have been so long discussed but never demonstrated 
by clinical and biological research“. The P-technique analysis of dynamic traits “> *) 
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also reveals the pattern of the self sentiment, showing in the case of the individual 
which particular attitudes and interests are most instrumentally bound up with 
the self structure. Lf it should be objected that the critical evaluation of free associa- 
tion methods and of so-called “‘ projective’? measures ought to be included in a survey 
of this kind one can only reply that it is superfluous to point to the defects of these 
systems in comparison with the positive action and precise results obtainable from 
P-technique in every situation in which the former might be used. Of course, mis- 
perception tests“, i. e. objective tests of ego defense mechanism activity, and other 
approaches to dynamic trait measurements by objective metheds, including selective 
type projection tests, provide the best kind of raw data on which to use P-technique 
in these longitudinal studies of the individual case, but their use without the structur- 
ing power of P-technique is equivalent to putting on the play without Hamlet. 


V. More PrEcISE SUMMARY OF STATISTICAL MEANING OF P-TECHNIQUE 


Although the new method is thus on a totally new level of objectivity and an- 
alytic power in camparison with free association and fantasy analysis methods ‘at 
present constituting the basis for the greater part of clinical research, there are still 
certain defects within it which remain to be systematically appreciated. A critical 
evaluation of P-technique would need to point out the following matters: 


1. The loading for a particular dynamic factor in any given behavior expression 
does not represent the intensity of action of the dynamic trait in the given 
expression but only its contribution to the variance of the latter. After P- 
technique has demonstrated the structural relationships it is a matter for further 
measurement with absolute strength measures to determine quantitatively that 
in which the clinician is perhaps most interested, namely, the relative strength 
of various motives in a particular piece of behavior. 

2. It should be appreciated that there are some experimental situations in 
which the transpose of the P-technique design namely, O-technique would be 
more suitable. In O-technique, as pointed out elsewhere ©, one correlates oc- 
casions within the individual (i.e. the individual correlated with himself on 
various occasions) instead of variables. This is most useful for picking out the 
patterns of a multiple personality. However, since, given the right conditions, 
O-technique and P-technique results are transposable, it is worth bearing in 
mind that some conditions, e.g. of accessibility of time, of data and of the patient 
will indicate that the O-technique design is the better approach to the factor 
structure in which we are eventually interested. 

3. Like all factor analytic techniques, P-technique depends on the assumption 
of linear relationships among the variables. Although curvilinear relationships 
appear to be very rare this is nevertheless a restricting condition upon the re- 
search utility of P-technique. 


4. In clinical practice, as distinct from clinical research, it may be objected 
that a successful analysis requires the patient to gain, by free association, an 
understanding of his own problems, whereas P-technique would only give the 
clinician an understanding of the dynamic connections in the patient (or give 
the patient an understanding of himself “from the outside’). However, it can 
be replied that if the clinician has a surer understanding of the patient’s dynam- 
ics than present methods give him, he would be able much more skillfully to lead 
the patient around his major resistances and to a more complete insight into his 
own case. Another objection in the realm of actual practice— namely, that the 
factor analysis of every patient would constitute too heavy a chore for the clin- 
ician—can be overcome by pointing out that the clinician would need to be 
equipped with a technician who in turn should have access to those electronic 
computing machines which permit a working out of even a most complex per- 
sonality structure in a relatively short time. 
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5. There is a danger that a P-technique factorization, being in a universe of 
its own and unconnected with measurements on other people, would present 
special difficulties in the location and interpretation of common drives, etc. 
This could be a serious practical difficulty, especially in very idiosyncratic per- 
sonalities. For this reason a proper evaluation of P-technique, or indeed of any 
other special factorial technique, requires that we recognize the necessity of in- 
cluding ‘‘marker’ variables from R-technique researches. These marker var- 
iables should be the most highly loaded variables in each of the common factors 
known in R-technique research, e.g. for the dominance factor, the sex drive, 
generalized anxiety, surgency, ego strength, etc. The historical deve lopment of 
factor analysis is in this respect fortunate, for R-technique has already estab- 
lished the principal ability and personality factors in a manner which the clin- 
ician can utilize when he comes to the more recent P-technique developments. 
Certainly it is going to make an enormous difference to the effectiveness of clin- 
ical research in the next decade according to whether R and P-techniques are 
used in irrelevant isolation or, alternatively, employed in broadly conceived 
‘two handed”, strategic combinations of R-technique and P-technique methods. 


6. Although P-technique is normally used upon the day-to-day variations of 
dynamic expression provoked by the daily stimulus of life events and internal 
chemistry “> 7»), it ean readily be used in combination with experimental 
control®, In this case the special stimulus conditions deliberately controlled 
by the therapist from day-to-day have to be entered as additional variables in 
the matrix. The method thus offers a possibility of testing at an early stage 
hunches about causation of symptoms. It also permits the therapist to be 
factored in with the patient, permitting a systematic exploration of the inter- 
action of personalities. 
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A CRITIQUE OF SYSTEMATIC THEORETICAL FOUNDATIONS IN 
CLINICAL PSYCHOLOGY 


ALBERT ELLIS 
56 Park Avenue, New York City 


INTRODUCTION 


One of the most important, if least explicitly expressed, questions in clinical 
psychology today is: Shall the development of new psychodiagnostic and psycho- 
therapeutic devices and techniques remain largely on an empirical, non-theoretical 
basis, or shall such devices and techniques be constructed and revised in the light of 
some systematic theoretical outlook? 

The empirical, or what Travers“? has called the technician’s approach to clin- 
ical research consists of the experimenter’s selecting his proposed instrument in,a 
fairly arbitrary or armchair fashion, giving it to various groups of subjects, refining 
it by statistically eliminating its non-discriminating parts or items, and finally end- 
ing up with a reliable and ‘‘valid” test or technique which successfully discriminates 
one type of individual from another. Such clinical instruments as the Strong Voca- 
tional Interest Blank and the Minnesota Multiphasic Personality Inventory have 
largely been derived in this empirical manner, with a minimum of systematic theory 
behind their construction and standardization. 

The systematic theoretical approach to clinical research largely makes use of 
the hypothetico-deductive experimental method. Researchers who utilize this 
method normally frame a distinct hypothesis along fairly broad theoretical lines, 
construct their instruments or technique in accordance with this hypothesis, and 
set up their experiments so that they will ultimately sustain or disprove this 
hypothesis. In the discussion that is to follow, we shall attempt to criticize both the 
empirical and the hypothetico-deductive experimental methods as they are usually 
applied to research in clinical psychology. 


THe EmprricaL METHOD 


The main advantages of the empirical approach to clinical research seem to be 
these: 


1. The technician who works with a minimum of systematic theory can often 
produce a clinical instrument of practical value with relatively little expenditure of 
time and effort. Thus, new projective techniques of personality evaluation keep 
appearing in the literature which are based on little or vague systematic theory; 
and some or them, such as drawing and painting techniques, have so far proved to 
have distinct diagnostic value. Again, electroshock treatment certainly appears to 
produce beneficial results in some cases of severe mental disturbance; yet no one, as 
yet, has presented a thoroughly satisfactory theoretical explanation of why it works 
in these cases (and why it does not work in other cases). 

2. As Travers has incisively pointed out, the empirical or technician’s approach 
to clinical research may sometimes produce a successful instrument even when the 
theory behind it is wrong. Thus, a researcher who builds a personality inventory on 
the assumption that psychosomatic questions will significantly dis¢riminate neu- 
rotics from non-neurotics may find, in the course of his standardization studies, that 
such questions are quite worthless; and, by the process of eliminating them, he may 
wind up with a reasonably valid inventory—for which he may be able to give no 
theoretical explanation. 

3. The empirical approach to clinical research often proves to have heuristic 
value in that one instrument or technique, even when theoretically unsound, may 


encourage the development of other instruments or techniques which may prove 
to be more useful and sounder. 
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4. Since empirical approaches to clinical research are generally instigated with 
some theoretical notions, however vague or badly thought-through these may be, 
and since the observation of unsystematic, and often accidental, facts may prov ide 
the impulse to creative synthetization out of which solid theoretical frameworks 
often grow, it may well be contended that little is ever lost by “purely” empirical 
research, and that ultimately systematic theory inevitably grows out of such research. 


Countering these advantages of the technicians’ approach to clinical research 
are the following disadvantages of this type of experimentation: 

1. Empirical construction of clinical devices or techniques frequently leads to 
little or no scientific knowledge. Even if the test or procedure—such as shock treat- 
ment or hypnosis—clearly works in a certain percentage of cases, the clinician often 
has no inkling whatever of why it works, and he is forced to utilize it in a rather 
trial-and-error, haphazard manner. 

2. Clinical instruments and procedures developed through the empirical 
method may work, all right, but may do so in a distinctly limited way. Thus, per- 
sonality inventories may be successfully used in the diagnosis of some individuals— 
and also result in the selection of numerous false positive and false negative diag- 
noses» 1), Again, various forms of physical treatments may aid the recovery of 
psychotics or severe psychoneurotics—and may also be employed in lieu of psycho- 
therapeutic treatments which may have considerably more usefulness with many 
of these patients. 

3. Clinical instruments and procedures which are developed by technicians 
without being adequately backed by systematic theoretical foundations may fre- 
quently do more harm than good, or may result in much wasted clinical. effort. 
Perhaps the best case in point here is that of the Szondi test—which in the writer’s 
estimation is certainly one of the worst clinical instruments ever foisted on 
innocent clinical psychologists. The original ‘genetic’ theory behind this test is not 
sustained by any modern geneticists and is hardly worthy of scientific consideration. 
Susan Deri’s neo-Freudian attempts to give the test a new theoretical foundation © 
are palpably irrelevant to Szondi’s original formulations, and are as unsound as 
orthodox psychoanalysis has generally proved to be. A score of research papers 
published during the past three years have conclusively shown that both the 
Szondi and the Deri theories of the test simply do not hold scientific water. Yet, 
several of the researchers who point out the invalidity of the Szondi and Deri theoret- 
ical formulation also go out of their way to note that, in actual clinical practice, the 
test sometimes produces “good” diagnostic results. What these writers fail to see is 
that any clinical technique—including observing the patient tie his shoelaces—will 
sometimes produce valid psychodiagnostic results in some cases—particularly when 
the patient is exceptionally emotionally disturbed. Consequently, by empirically 
“validating” clinical instruments by showing that, despite the fact that they do not 
generally work, they specifically are successful in a few selected cases, researchers 
merely manage to keep relatively worthless tests like the Szondi in clinical use for 
decades beyond the experimental disproving of both their theory and practice. 


Tue Hyroruetico-DepuctiveE MrrHop 


Proceeding now to a consideration of the hypothetico-deductive method 
clinical research, we find that this method has several advantages: 

1. Normally, hypothesizing a psychological theory, and then setting up an 
experiment to prove or disprove the stated hypothesis leads to generalized scientific 
knowledge over and above that previously gained. Even if the theory is a bad one, 
it is often worthwhile finding out just how bad it is. Moreover, disproving one 
hypothesis may easily result in counter hypotheses which may be subsequently sub- 
stantiated. 





A CRITIQUE OF SYSTEMATIC THEORETICAL FOUNDATIONS 13 
2. When the hypothetico-deductive method of clinical research is strictly 
followed, instruments and techniques may, with perhaps a minimum expenditure of 
time and effort, be factually sustained or rejected. A new projective technique, for 
example, that is based on some specific theoretical foundation may be experimentally 
studied in such a manner that, quickly and efficiently, it will either come into wide- 
spread clinical use or be dropped from the literature. Without such a theoretical 
basis, it may be impossible, ever, to validate or invalidate it, and it may continue in- 
definitely in relatively ineffectual use. 

3. The hypothetico-deductive method of formulating a systematic theory and 
then endeavoring to substantiate or refute it by factual experimentation frequently 
has distinct heuristic value, in that it encourages the setting up of alternative hyp- 
otheses and of sub-hypotheses, and often leads to further basic research. 


4. The development of satisfactory clinical instruments and procedures by 
use of the hypothetico-deductive method may, in the long run, prove to be less time- 
consuming than developing them by the more empirical method. For whereas the 
technician’s approach may produce an instrument more quickly than the theorist’s 
approach, the particular instrument developed in the latter manner may prove to be 
much more substantial and clinically valid than that developed empirically. Thus, 
the Rorschach, which has some systematic theory (albeit often vague theory) be- 
hind it has stood up better over the years as a clinical tool than have several other 
projective techniques (such as Stern’s cloud pictures) whose theoretical framework 
is more nebulous. Moreover, if clinical procedures are developed in the light of valid 
theory, they may be expected to be more practical and useful than those developed 
in the light of no theory—even when, as will often be true, more time and energy 
is expended in their development than is ordinarily expended in more empirical ap- 
proaches to clinical methods. 


5. The use of the hypothetico-deductive method in clinical psychological re- 
search tends to emphasize basic and vitally important issues and to open broad areas 
for further thinking and investigation. Thus, in a recent paper Wexler “*) notes that 
the therapist’s bolstering the guilt feelings of the patient may at times be of more 
therapeutic advantage than his trying to assuage these guilt feelings. This observa- 
tion, as such, may be of some clinical vale, but in itself is likely to lead to a change 
of technique by many therapists who hardly realize the implications of their changes. 
To help prevent such a relatively meaningless change in clinical procedure from oc- 
curring as a result of his observation, Wexler goes on to place it within the frame- 
work of a complex theory of psychosis and of psychotherapy, which he in turn re- 
lates to existing Freudian theory. The present writer, for one, would tend to disagree 
with some of Wexler’s hypothesizing in this connection; but the important point is 
that he has placed his empirical finding within a consistent theoretical framework, 
and that the subsequent upholding or confuting of his hypotheses by additional re- 
search will probably add much more to our knowledge of therapy and psychosis 
than would the mere blind use of his therapeutic technique. 


These, then, are some of the main advantages of the hypothetico-deductive 
method in clinical psychological research. The method, at the same time, has serious 
disadvantages, some of which may be listed as follows: 

1. It is normally a difficult method to follow in practice, and is frequently 
quite expensive in terms of research time and effort. 


2. As it is usually practiced, it makes use of a good number of constructs which 
may be quite difficult to anchor in observable facts, and which consequently may 
prove to be intrinsically unprovable or undisprovable by concrete experimentation. 
As the present writer has elsewhere pointed out, “ * this is particularly true of ortho- 
dox Freudian hypotheses, which involve constructs like “id,” “superego,” and 
“ego,” which occasion much discussion, but which are virtually undefinable in con- 
crete terms, and hence unverifiable. Non-Freudian clinical theory, as well, is pre- 











14 ALBERT ELLIS 


sently filled to overflowing with such vague, overlapping, and unrealistic concepts 
as “‘personality,” ‘‘mind,” ‘‘self,’”’ and so on. 


3. The hypothetico-deductive method frequently results in vested theoretical 
interests which give rise to devotees who fanatically spend their lives at desperate 
attempts to validate a given systematic theory. Thus, we have the Rorschachists, 
the Freudians, the Rogerians, and many other psychological sectarians who view 
virtually everything in terms of their own limited, closed systems, and who obviously 
would never surrender their faiths even if factual experiment after experiment 
proved these theoretical systems to have limited or little objective validity. 


4. Conducting clinical psychological research within a strict theoretical frame- 
work often has a falsely heuristic value, in that much thought and “fact’’-finding 
is thereby stimulated, but mostly of a trivial and erroneous nature. Thus, theories 
like Jung’s idea of the racial unconscious and Rank’s notion of the birth trauma have 
led numerous over-credulous and mystical-minded writers to offer the wildest specu- 
lations and ‘facts’ in support of these views, and have thus resulted in ‘researches”’ 
of the most dubious value. Again, attempts on the part of several contemporary 
psychoanalysts and psychologists to merge +psychotherapeutic theory with tele- 
pathic theory “4 > © have, so far, had the “heuristic’’ value of inducing scores of 
clinicians to spend considerable time on what would appear to be crackpot projects 
instead of much-needed practical researches. 


5. The hypothetico-deductive method in clinical psychology frequently en- 
courages essentially scientific, and yet decidedly trivial, researches. So eager 
are the proponents of a given system to validate the hypotheses of this system 
that they often arrange setup experiments which prove some minor aspect of that 
system, and which never tackle the main clinical problems which beset us today. 
This has been particularly true of some experiments conducted by the so-called 
“elient-centered” clinicians, experiments which usually prove some point dear to 
the hearts of the Rogerians but virtually meaningless to non-Rogerian therapists. 
Thus, in a recent study Bergman“ shows that in ‘‘client-centered”’ counseling self- 
exploration and insight are more apt to follow the therapist’s reflecting the client’s 
feeling, after this client has requested an evaluation from the therapist, than they 
are apt to follow the therapist’s giving a direct answer or interpretation to the client. 
Obviously, however, a client who (a) has selected a ‘‘client-centered”’ counselor in 
the first place; who (b) has had several sessions of “client-centered”’ structuring in 
his therapeutic sessions; and who (c) has a therapist who has been specifically trained 
to avoid, be squeamish about, and to be ashamed of giving direct interpretations— 
obviously such a client may be set to feel more comfortable with a mere reflection 
of his feelings rather than a direct answer to a question he has asked of the therapist. 
The important issue is, however, whether interpretation in general, and in any type 
of therapy, is more effective than mere reflection of feeling; and this issue, of course, 
Bergman (and other ‘‘client-centered”’ researchers) has entirely avoided. 


DISCUSSION 


In view of the foregoing advantages and disadvantages of both the empirical 
and the hypothetico-deductive methods of conducting clinical psychological research, 
what practical conclusions may be drawn? The present writer has no strong feelings 
about either of these methods, since he feels that neither normally exists in a pure 
form, and that virtually all clinical experimentation and theorizing is a product of 
both. Systematic theories, for example, are virtually never dreamed up out of pure 
cogitation, but are creatively compounded out of a great deal of prior empirical ob- 
servation. By the same token, the constructing of clinical instruments and tech- 
niques by the empirical method is invariably done on the foundation of (conscious or 
unconscious) previous theorizing. Moreover, flagrant abuses of scientific research 
procedures are common among experimenters who utilize both the empirical and the 
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hypothetico-deductive methods, and do not seem to be intrinsic to the methods 
themselves, but rather to the people who use them. 

The writer’s personal prejudices, therefore, are mildly in favor of the hypothet- 
ico-deductive method, largely because it seems to be more rigorous, more con- 
sciously scientific, more wide-ranging, and in the last analysis probably more effic- 
ient than the empirical method of research. Particularly in the field of clinical 
psychology, where it is important not merely to understand the hows of human be- 
havior, but especially important to understand its whys, it would seem that the more 
complex and systematic hypothetico-deductive method stands a better chance of 
getting at the ultimate diagnostic and therapeutic goals which are in the forefront 
of our endeavors. 

With the consciousness, however, that there should be some kind of systematic 
theory behind much if not all of our clinical research, should go these distinct 
warnings: (1) That this systematic theory be operationally grounded, as far as is 
feasible, in behavioral facts, or in constructs which may be closely anchored to such 
facts. (2) That the theory be consistent and logical, but that it not be so dogmatically 
systematic that it cannot easily be changed when objective data warrant such 
change. (3) That the theory be unemotionally upheld by its adherents in a non- 
cultist, non-vested interest manner. (4) That the theory be stated as simply as 
possible in such a concrete manner that it is clearly possible to verify or to disprove 
it in the light of experimentally obtained factual evidence. 


SUMMARY 


It is pointed out in this paper that there are, in general, two basic approaches to 
clinical psychological research: (1) The empirical approach, which starts from ob- 
served data, and which develops clinical instruments and techniques directly from 
this data, with little or no attention being given to systematic theoretical formula- 
tions. (2) The hypothetico-deductive approach, which starts with a consistent, well- 
organized theory, and which designs experiments and instruments in conformity 
with this theory. Several advantages and limitations of both these approaches to 
clinical research are discussed, and it is noted that neither one is usually employed in 
a ‘pure’ manner and that both are subject to various kinds of abuses by individual 
researchers. The hypothetico-deductive method is finally favored, but with the 
proviso that those employing it take great care to see that systematic theory is 
operationally grounded in observable facts, that it be changeable and undogmatic, 
that it be unemotionally upheld by its adherents, and that it be concretely stated in 
clearly verifiable or disprovable terms. 
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CRITIQUE OF SCATTER AND PROFILE ANALYSIS OF 
PSYCHOMETRIC DATA* 
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INTRODUCTION 


One of the major clinical developments of the past decade has been the attempt 
to derive methods of personality diagnosis from the scatter and profile analysis of 
psychometric data. Following the initial reports suggesting that scatter and profile 
might have differential diagnostic significance, many clinical psychologists attempted 
to interpret psychometrics according to the new-found ‘‘signs”’. Unfortunately, sub- 
sequent research has failed to confirm the validity of these diagnostic signs. The 
present paper represents an attempt to evaluate the rationale of scatter and profile 
analysis of psychometric data with specific reference to published research on per- 
sonality diagnosis with the Wechsler-Bellevue Intelligence Scale for Adults. 


RATIONALES 


It is appropriate for a critique of research methodology in any field to start with 
some attention to rationales, implicit or explicit, which have stimulated and directed 
the studies. When one reviews the considerable literature on pattern analysis of the 
Wechsler-Bellevue Intelligence Scale for Adults, two orientations appear. They 
are not mutually exclusive, but they tend not to be conjointly expressed. These two 
rationales may be effectively expressed in the following introductory clauses: 


1. “It would be nice if...’ 
2. ‘It seems reasonable that... 


” 


In view of the currently extensive utilization of incomplete sentences by clini- 
cians“), the reader should have no difficulty in providing an ending for the first 
clause. A modal response would perhaps read as follows: ‘“‘It would be nice if an 
instrument of demonstrated validity for the measurement of general intelligence, 
i.e., the Wechsler-Bellevue, could also be demonstrated to yield valid data about 


non-intellective variables, i.e., personality.” The overburdened clinician is not 
likely to take exception to such a proposition as a statement of a practical ideal. 
That the sheer desirability of such a possibility should serve as the essential ration- 
ale for repeated researches brands the clinician as victim of a degree of autism 
which in a patient would prognosticate a severe illness. More pattern studies of the 
Wechsler have been rationalized by reference to this ideal of an instrument of multi- 
ple potential than by statement of any more formal logic. 

In those studies for which a statement more closely approaching a rationale has 
been provided, there is no simple ending for the clause:‘*It seems reasonable that...” 
which completely explicates the structure of the assumptions and hypotheses on 
which the research is founded. A modal completion might read: ‘It seems reason- 
able that every behavior of the individual, e.g., every test response, is expressive of 
both intellective and non-intellective or personality factors in the organization of the 
individual.” There is no quarrel with this queenly statement. However, it is redolent 
with implications of more presumptive statements which more directly generate 
attempts to discover personality patterns from analysis of Wechsler test perform- 
ances. Some of the implications are so obvious and have been so frequently stated 
as to require no repetition here. An obvious proposition equally implied but rarely 
stated is the following: “Since every behavioral act is expressive of both intellective 


*This paper was presented for the writer by Mr. John Pearson at the Chicago meeting of the 
AMER CAN PsycHOLoGicaL AssociATION, August 31, 1951. 
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and non-intellective factors, every act is as good a sample of intellective as of non- 
intellective factors.” This is a statement we will quickly reject. The general theory 
of psychometries and sheer considerations of efficiency lead to selective stimulation 
and observation of our subjects with reference to the particular variables in which 
we are most interested, and not to attempts to deduce a widely variant set of data 
from a single variety of response. This is simply a recognition of imperfect relation- 
ships among relatively discrete classes of behavior and of the fact that it is difficult 
to find reasonably homogeneous families of response which bear equal let alone high 
factor loadings on cognitive, conative, and affective dimensions. This is by no means 
a denial that the individual behaves with all of himself in an integrated dynamic 
manner! 

Any intelligence test will elicit behavior which is expressive in some degree of 
personality variables. To the degree that it has been carefully constructed to be 
maximally loaded on general and group factors of intelligence in a standardization 
sample in which personality variables are randomly distributed within a normal 
range, the variance in performance attributable to personality tendencies will be 
minimal. 

These considerations do not per se render gratuitous the efforts of those workers 
who have sought to tease out clues to personality from differential patterns of 
achievement on the Wechsler. They do, however, suggest the probably limited 
range within which personality variables may show themselves in intelligence test 
performance and, consequently, the difficulty of determining their operation. In 
view of the restricted variance potentially contributed by these variables, the use of 
clearly disparate personality groups would seem necessary in order to demonstrate 
differentials. From this viewpoint, students of patterning have been somewhat un- 
fairly chided for their use of grossly different groups, e.g., normals and psychotics. 
It follows, of course, that failure to demonstrate differentiation with such widely 
different groups makes it extremely improbable that there may be elicited from 
Wechsler ‘patterning subtle clues helpful in diagnosis of borderline cases. It is 
important to note that discovery of such clues is the most generally presented 
purpose underlying pattern.and scatter studies of the Wechsler. 

There is a further item of rationale which demands careful review. This in- 
volves the assumption that in a normal, intact, efficiently functioning individual, 
performance on all of the Wechsler subtests will be equally inferior, superior, or 
average. Specifically, the assumption is that performance at one sigma above the 
appropriate norm mean on the Information test will be accompanied by performance 
one sigma above the mean on the Comprehension subtest, ete. Statistically, the 
assumption is one of approximately perfect positive correlation among the subtests 
in the normal person, at least insofar as the intellective factors in those subtests are 
involved. The companion assumption is that variability in level of performance on 
different intellectual functions is reflective of personality variables, psychiatric 
illness, or temporary emotional disturbance. The basic assumption would appear 
to reflect, at least tacitly, a theory of intelligence in which a general factor is 
considered clearly dominant and heavily present in the tasks commonly utilized 
in intelligence tests, with group and specific factors of intelligence not carrying 
sufficient weight to account for variability in success from task to task. The theory 
of intelligence while still incomplete and in process of construction seems currently 
to be giving increasing importance to the nature and extent of group and specific 
factors which, while positively correlated, demonstrate much independence of var- 
iance. In terms of our present knowledge it would not appear feasible to argue 
that group factors, in contrast to g, are exclusively expressive of non-intellective 
variables. It is reasonable to entertain the notion that some of the subtest pattern- 
ing observed in the Wechsler is reflective of the basic organization of intellective 
functions. This would not obviate the possibility of diagnostic patterns but would 
necessitate relating psychiatric illnesses to patterns of mental organization rather 
than to personality variables. 








18 WILLIAM SCHOFIELD 


Certainly when one examines the data on subtest inter-correlations provided by 
Wechsler for two sizeable subgroups from his normative sample, one finds no basis 
for anticipating equality of subtest performances °*). The intercorrelations reported 
range from +.155 to +.721 in the 20-34 year age group, and from +.274 to +.705 
for the 35-49 year age group. The modal correlation for both age groups is of the 
order of +.45, and the distribution of coefficients is essentially the same for the older 
and younger normals. Since the longer opportunity for operation of differential 
personality trends and the possibly homogenizing influence of increasing age per se 
would both operate to lower intertest correlations in the older age group, the lack of 
any suggestive trend in this direction weakens the argument that patterning is 
mostly reflective of experiential and affective variables. 

A final comment regarding rationale appears warranted. It is tied closely to the 
“It would be nice if...” orientation. There have been a number of studies in which 
the expression of a rationale seems limited to an appreciation of the gross structure 
of the Wechsler-Bellevue as lending itself readily, by virtue of its subtest homogen- 
cities and whole test heterogeneity, to study of patterns! 


Purposes OF PATTERN Srupy 

Independent of implicit and explicit rationales, one may consider the purposes 
to be served, or perhaps more exactly, the goals of pattern study as expressed by 
various researchers. There seems to be a dichotomy of purpose, with few studies 
indicating both purposes. On the one hand, and most frequently, there are studies 
which seek, as mentioned before, to detect clues to personality useful in the resolu- 
tion of difficult, borderline diagnostic problems ©: § 15, 18, 21, 22, 30, 83, 36, 37, 48, 47, 48, 50), 
On the other hand,-and less frequently, there are studies which look to pattern an- 
alysis as a medium for achieving greater insight into particular illnesses such as 
schizophrenia 17 *) %).) Such attempts at better understanding of psychiatric 
syndromes through study of variations in functioning efficiency seem to have been 
very tenuously pursued and have not led to reformulations of etiology, new sug- 
gestions regarding therapy, or revisions of nosology. In these studies, there has been 
some greater awareness of the limited validities and reliabilities of the Wechsler sub- 
tests but with no particular effort at extension or improvement of the subtests or in 
particular of those subtests seemingly most related to particular dysfunctions. 
Actually, the limitations on pattern study imposed by the reliabilities and validities 
of the subtests are no greater in these investigations than in those directed to diag- 
nostic differentiation, but they seem more obvious in analysis of particular psycho- 
pathologies. It has been pointed out by Jastak ©? that study of the mental func- 
tioning of various pathological groups demands more extensive and more reliable 
measures than are provided by subtests of the Wechsler, and further, that with a 
view to the goal of differentiation, the application of factor analysis might suggest a 
more rational and efficient selection of relatively discrete functions than are sampled 
by the Wechsler subtests. Jastak has further pomted to order of difficulty of items 
in Wechsler’s standardization and to sex differences in item difficulty as possible 
sources of obfuscation in the elicitation of intratest patterns. Related to this as 
detrimental to patterning is the truncation of the Wechsler at the extremes of the 
ability range @® 


DEFICIENCIES OF RESEARCH DESIGNS 

Comments on the specific methodology of researches on Wechsler patterning 
might well begin with a resumé of the ‘‘box score’”’ to date. There were a total of 34 
studies accessible for review. For purposes of a relatively gross analysis of results, 
these 34 studies may be distributed into three categories according as they yielded 
(or were interpreted by their authors to yield) positive, negative, or ambiguous find- 
ings. Only four studies appeared to this reviewer to be ambiguous. Of the remain- 
ing 30 studies, 21 yielded negative conclusions and only nine were interpreted by 
their authors as supporting the hypothesis of diagnostic patterning. In other words, 
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approximately two-thirds of pertinent studies have given negative findings. This 
proportion is not determined by attempts at differentiation of particular groups but 
holds equally for studies involving schizophrenia, psychopathic personality, and 
other less frequently studied groups. A time analysis of the frequency of positive 
and negative studies is suggestive of the probable effects of increasing awareness of 
need for certain controls and of provision of these controls. While only slightly 
better than 50 per cent of the positive studies have been reported in the literature 
since 1945, over 70 per cent of negative studies have appeared in the same interval. 

When one seeks to account for the disagreement among studies, one finds 
peculiar disjunction in the controls observed by various workers. A ‘‘mote and 
beam”? phenomenon has seemed to operate, each worker being careful to provide the 
controls neglected in previous studies but being equally careless with regard to other 
controls in his own design. Thus, one finds early attempts to elicit diagnostic pat- 
terns neglectful with regard to the equating of ability levels in the groups studied. 
Likewise, differences in the age ranges of the various groups compared were common- 
ly disregarded in earlier studies. Garfield has contributed a particularly valuable 
paper in demonstrating the patterning effects of age, educational level, and IQ 
level“, His utilization of rank order of subtest means weakened his argument 
since this measure, while reflecting group trends, ignores intra-individual patterning. 
Numerous studies of subtest patterning have been content with analysis of group 
subtest means or rank orders and, consequently, have not been definitive with res- 
pect to the potential diagnostic contribution of individual patterns of relative 
achievement. 

There have been sufficient demonstrations of the need to control the age, gen- 
eral ability, and educational achievement variables as to discourage acceptance of 
findings when these factors have been permitted to introduce distorting vari- 
ance “°, 15, 8, 4. Recent studies have been generally conscientious with regard to 
these controls and in some instances have added controls for sex, race, and rural- 
urban origin °°), There has continued to exist, however, a striking lack of con- 
cern for the adequacy of the criteria employed in these studies. Thus, in many in- 
stances nothing has been specified concerning the status of a patient group other 
than a diagnosis of schizophrenia, for example. Disregard for the probable heter- 
ogeneity of any such diagnostic group has undoubtedly beclouded the findings in a 
number of studies. More specific diagnostic indications, such as of paranoid schizo- 
phrenia, catatonic schizophrenia, ete., have only very slightly enhanced the degree 
to which the source of divergences between studies may be sought in varying composi- 
tion of the subjects. Rather than feebly railing against the attenuating effects intro- 
duced by a psychiatric nosology of questionable validity, it would appear necessary 
and easy to provide that better depiction of various subject groups which would be 
communicated by their symptomatic status. Heyer in his study of anxiety neurotics 
is one of the few workers who has provided for the possibly equivocal referents of 
diagnostic labels by providing data on the frequency of predominant symptoms in 
his experimental group ©. 

Still another variable which has been widely ignored in pattern analysis is that 
of severity or duration of illness. Perhaps there is a vague suggestion on this point 
involved in the formal diagnosis of a group, but the possible variability is too ex- 
tensive to warrant neglect. Insofar as patterning in subtest performance is looked 
upon as reflecting inefficiencies stemming from personality aberration, the extent 
and duration of such aberrancy would likely determine in part the degree of in- 
efficiency. Thus, in the study by Olch, two-thirds of her schizophrenics had been 
hospitalized less than six months “*. Some of the discrepancies between her subtest 
ranks and those reported by Rabin, for example, may arise from the fact that Rab- 
in’s subjects had been hospitalized longer although he did not provide information 
on this point °®. It is pertinent to note that Pascal and Zeaman who attempted te 
differentiate among acute, chronic, and deteriorated psychotics by Wechsler sub- 
test differences were unsuccessful, although the same difference measures did pro- 
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vide group separation of neurotics from psychotics and in-patient neurotics from 
out-patient neurotics ©. Together, actual symptomatic status and duration of ill- 
ness are two variables which have been unfortunately ignored in patterning studies. 
Failure to control for these, or at least to specify them adequately, may account for 
much of the mixture of positive and negative findings. It should be obvious that 
differential signs picked up in the study of groups having flagrant disorder of long 
standing might well evaporate if cross-validation were attempted with milder cases. 

The fact that the system of equated weighted scores for the subtests is based 
on the raw score distributions of Wechsler’s 20-34 year old age group introduces age 
distortions in the patterning effect inasmuch as older age groups among Wechsler’s 
norms produce widely variant mean weighted subtest scores. Both Foster“ and 
Barnett ©? have been bothered by this and have proposed corrective procedures. 
While the use of Z-scores as provided by Barnett would constitute a desirable re- 
finement, the clarifying effects of such procedure would likely be hidden by the con- 
trol deficiencies already mentioned. 

Still another criticism of pattern studies may be directed to their nearly uni- 
versal contentment with statistics of group differentiation, specifically critical ratios. 
The lack of demonstration of clinical utility which results when only the reliability 
of group mean differences is tested has been clearly acknowledged by some workers. 
Not uncommonly, however, has the recognition not appeared but, furthermore, 
papers have failed to report sigmas for supposedly differentiating indices. This 
criticism would be somewhat less damaging if workers, content to show group separ- 
ation, had proceeded to the necessary next step of exposing their measures to cross- 
validation. Attempts to cross-validate subtest patterns once they have been elicited 
have been almost non-existent. When measures of supposed clinical value for in- 
dividual diagnosis are proposed, “percentage overlap” is the simple but rigorous 
statistic to be applied. Students of Wechsler patterning have almost never reported 
results in percentage overlap terms. 

One final comment must be made concerning the relative frequency of attempts 
to derive diagnostic patterns and indices and of attempts to check the validity of 
such measures in clinical work. The great majority of researches in this area have 
involved derivation and comparison of measures. Only rarely have attempts been 
made to apply patterning to clinical diagnosis. In the three such attempts which 
have been reported @°: *: ®), two of the studies reported positive findings. However, 
in both of these studies the criterion of psychiatric diagnosis was contaminated by 
prior knowledge of test results and their interpretation. Furthermore, in one of 
these studies, no data was provided as to what signs were used or in what manner the 
individual diagnoses were reached. In the best designed of the three studies, one 
which involved study of inter-clinician agreement in application of Wechsler’s diag- 
nostic signs, the results were essentially negative ©°?. 


SUMMARY 


To summarize, attempts to use Wechsler-Bellevue subtest patterning as a clue 
to personality have stemmed more from a wish than from a rationale, have been 
neglectful of obvious limitations of the subtests as psychometric measures, have only 
recently been properly attentive to the control of pertinent variables, have shown a 
fixation at the level of gross group differentiation without cross-validation, and have 
produced generally negative results. 
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INTRODUCTION 


In this paper I intend to raise some questions of importance for research in 
clinical psychology. The specific issues concern the validity of our various ap- 
proaches to personality testing and the diagnostic and prognostic uses of test results. 
As the published evidence shows, the vigorous attack of clinical psychologists on 
these problems is becoming constantly stronger and more effective. A healthy con- 
sequence of this trend is that we may approach this topic in the clinical section as a 
scientific problem on the same basis and in the same terms as we would in the psy- 
chometric or experimental sections. The issues are basic to all psychological re- 
search. They are clinical only in a restricted sense. Before we launch into our as- 
signed subject, I ask your indulgence to report an interesting anecdote: 


A few years ago I was impressed by an account of the method used to weigh pigs in Patagonia. 
In that country the pig was indispensable to the national econemy as a source of food, leather, fats, 
and many oy-products, and its weight a basic unit of exchange. Hence by regal decree it was required 
that pigs be weighed only at Government weighing stations by properly qualified and appointed 
weighmasters. At each weighing station a staff consisting of a supervisory master and four assistants, 
usually ranging from GS-9 to GS-13, would work as a team. Weighing was accomplished by means of 
a six-ply mahogany board, exactly 117 inches long and 29 inches wide, balanced on a tubular mahogany 
rod suspended between two uprights. Both the board and the rod were kept highly polished at all 
times and checked daily for imperfections and wear. When a pig was to be weighed, the board was 
first suspended on the rod and its position adjusted until it was in perfect balance. It was then fastened 
in position by means of specially cut and tanned leather thongs. The pig was then placed in a marked 
off space at one end of the board by the junior weighmasters while the others proceeded with the 
weighing process. From a case containing a wide variety of native rocks, ranging widely in size and 
shape, but all highly polished and carefully smoothed, they would select one at a time and place each 
in neat rows beginning at the opposite end. Rocks were added and adjusted until the weight of the 
rocks exactly balanced the weight of the pig. The pig was then removed and the weighmasters would 
determine his weight. They would do this by counting the number of rocks of various sizes and shapes 
 necheasr a their impressions into a global judgment expressed in terms of an average pig as & 
final weight. 


With this orientation, it may be appropriate to open the discussion by reference 
to an experimental study. A number of the problems to be considered can be illus- 
trated concretely by reference to Gardner Murphy’s“*? lucid discussion of How- 
ells’ &) well-known study of persistence: 


The persistence measured in this experiment is the endurance of pain while the experi- 
menter looks on. Eight situations are presented, such as ‘‘edged instrument pressed against 
thumb,” “holding hand over hot coils,” and so forth. Each situation becomes more grueling 
until the subject says stop; hence, the time that elapses until the word is given measures the 
amount of punishment. (The raw correlation of endurance in four of these situations with 
endurance in the other four is about .80, which gives a reliability coefficient of .90 for the 
eight.) All eight tests involve the acceptance of pain for the sake of making an impression on 
the experimenter and hence seeing oneself favorably. There is no way of telling to what extent 
the scores express the intensity of the physiological pain mechanisms (for example, one person 
may feel less pain than another in response to the same stimulus) and to what extent they ex- 

yress the sheer need to appear heroic, but we can be certain that persistence as represented 
es is not what might be called a simple physiological trait like “tissue lag’ or autonomic 
threshold; it is courage in a specific authority-laden situation. 


As we consider Murphy’s comments on Howells’ experiment, several conceptual 
problems are touched. First, the subject’s behavior was observed, not in a general- 
ized or in a neutral behavior situation or field, but in one which was ‘authority- 


laden.”’ The test behavior cannot be understood apart from the field situation in 


which the test is administered. Second, the individual behavior fields may vary sig- 
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nificantly among the subjects. Third, the test behavior, however obvious it may ap- 
pear, cannot be interpreted simply in terms of its content or face validity. To 
achieve practical psychological meaning, test behavior must be related to a satis- 
factory criterion. Although Murphy calls attention to impressive and seemingly 
reasonable hypotheses to explain the test behavior, namely the mechanisms of self- 
enhancement and self-defense, his interpretation is as unsupported by confirmatory 
evidence as those advanced by Howells. Fourth, Howells’ unit of measurement 
namely, endurance time, implies a meaningful representation of the physiological 
response. This raises questions which again refer us to the criterion. If the units of 
measurement are meaningfully related to the criterion, their usefulness need not be 
questioned. 

I should like to raise directly and by implication a number of problems which 
affect the validity of interpretations of test behavior. These will be discussed under 
four headings: 


1. The testing field situation. 
2. The individual behavior field. 
3. Units of measurement. 

4. Criteria. 


While this analysis is admittedly incomplete, it is hoped that it may contribute to 
the understanding of these problems. 


Tue Testinc Fievp Sirvatmn 

The testing field situation structures the task. Several of the more superficial 
aspects of this problem have long been appreciated. Recognition of variation in 
motivation toward thé test in different situations has led to the design of special keys 
such as the L and K scales on the MMPI, and the check scores on the Kuder interest 
schedules. My own work in the Air Force on the development of psychiatric selection 
tests for flying personnel has demonstrated clearly how vulnerable are tests like the 
Cornell Selectee Index and certain other questionnaire and multiple-choice associa- 
tion tests to the high motivation of would-be cadets to gain admission to the 
program. 

There are, however, many subtle aspects to this problem. These are more 
difficult to recognize and perhaps may explain many anomalous results if analyzed. 
The example mentioned initially, in which Murphy suggested that the uniformity of 
results in Howells’ persistence tests of endurance of pain might be accounted for by 
the authority-laden test situation, illustrates one aspect of the problem. It should be 
noted that in this case the recognition of the situational factor raises basic questions 
about what these tests mean—what they measure. These questions can only be 
answered empirically, in relation to an appropriate criterion. 

Another situational factor is illustrated by some very recent and as yet un- 
published findings in my own laboratory. Mr. Donald Ogdon compared group and 
individual Rorschach tests administered to a selected sample of 100 aviation cadets. 
The group test was administered as part of an experimental screening battery to 
classes entering pilot training at Randolph Field. The individual test was ad- 
ministered clinically, usually several months later. The two sets of protocols were 
compared on Elizur’s “) content scale of anxiety and hostility, which scores responses 
which express feelings or attitudes of fear, unpleasantness, hatred, dislike, and so 
forth. In comparing the group and individual protocols for each individual a num- 
ber of qualitative differences were noted. In many cases a response appeared quite 
uninhibited in the group test, but controlled in the clinical situation. For example, 
in one case the following response was given to card 3 in the group test: 

‘two men attempting to pull lungs from trunk of a body” 

In the clinical situation it appeared as: 
““two people trying to pull something apart.” 
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Our results seem to show the following trends: first, a greater number of unrestrained 
responses in the group than in the individual test, and second, individual differences 
in this tendency. Several interesting hypotheses related to these findings are being 
investigated. It is clear, however, that one important hypothesis requiring careful 
study is that the presence of an examiner in the clinical test may inhibit the report- 
ing of highly emotional content. This has many ramifications. I merely desire to 
raise the question. 

From the examples we have cited it can be seen that the testing situation may 
structure the task of the subject in devious and subtle ways. The personality of the 
examiner, the shift from individual to group administration, the significance of the 
outcome to subjects, the auspices under which the test is given are factors Which 
illustrate this general category. Their effects, too frequently assumed, need to be 
studied systematically. 


Tuer INDIVIDUAL BEHAVIOR FIELD 


This is a factor of controlling importance where test results are used in diag- 
nosis or evaluation of therapy. Again, we will begin the discussion by referring to 
some of the more general aspects of this problem. First, there is the question of how 
the individual subject relates to the examiner and testing situation. We might agree 
with Murphy that the “persistence” tests were given in an “authority-laden”’ situa- 
tion. And yet, if we should desire to interpret an individual’s results, we must in- 
quire further to what extent the particular individual accepted the situation as such, 
the nature of his particular attitude toward authority, and how he reacted to the 
authority figure in the test situation. This issue is frequently called “rapport,” but 
it implies a great deal more than apparent acceptance of the test situation. 

Second, the past history and training of the individual must be understood. A 
rather obvious practice based on this point is seen when Rorschach workers are 
trained to allow for color vision anomalies, medical training, and, in an Air Force 
situation, aviation and military experience, in interpreting results. This problem is 
far more profound, however, than mere allowance for broad and obvious character- 
istics of the subject. It might be said that the power of diagnostic interpretation of 
test results is a function of the psychologist’s understanding of the individual. This 
is illustrated by the study of Hertzman and Pearce “ in which the meaning of human 
figure responses that patients projected into Rorschach ink-blots became identified 
through a study of behavior in therapy, dreams, self-description and other case 
history data. So-called ‘‘blind” or formal summaries of test records are necessary 
to obtain scores which are meaningful in terms of standardization. When applied 
to individual cases, however, the scores require further careful interpretation. 

The third point is closely related and perhaps overlaps the first two. This is 
that any sample of individual behavior must be related to the total personality of 
the individual. Part characteristics, traits, or other descriptive aspects of the per- 
sonality have different meaning when the individual is regarded as the universe than 
they do when regarded as variates in a population of test scores. This problem has 
been discussed by Cattell©? from the standpoint of mathematical design and by 
Rosenzweig “*) from that of personality theory. 

Recognition of these important problems described under the individual be- 
havior field may help clarify some of the issues of test validity in diagnostic applica- 
tion. We may use test materials as informal devices for clinical examination. Indeed, 
this is a close approximation of our current practice with projective technics. How- 
ever, in this sense such materials should not be regarded as tests which are rigorously 
standardized and validated objective instruments for describing and measuring be- 
havior under prescribed conditions. 

Under optimal conditions, which must unfortunately await the availability of 
testing instruments of demonstrated validity, the clinician will accept test results as 
nomothetic facts, which describe the individual out of his idiodynamic context or 
field. In applying these results to the diagnostic problem, the clinician will use his 
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knowledge of the individual’s background, current problems and general situation. 
He will relate this part information to the individual whole. In this sense we may 
regard diagnosis as a professional function using valid test information. The validity 
of diagnosis is in every case a research problem requiring specific individual criteria 
and in which the specific items of evidence employed are subordinate to the diag- 
nostic conclusions. Tests used in diagnosis are of value to the extent that they are 
independently validated. 


Units oF MEASUREMENT 


The point was made earlier that units of measurement must be meaningfully 
related to the criterion. Using Howells’ experiment for illustration, again it might 
be argued that his measurements in terms of endurance time were not a meaningful 
representation of the physiological responses studied. Nor were they a meaningful 
representation of ‘‘courage in a specific authority-laden situation,” as Murphy inter- 
preted these scores. However, these measurements could be made meaningful if 
demonstrated to have a significant correlation with an acceptable criterion and con- 
verted into a statistical scale. 

The problem of statistical treatment of test measures of dynamic aspects of 
personality is today a frontier of research in clinical psychology. Most of our psy- 
chometric technics of data analysis are based on a mathematical model assuming 
linear variables additively combined. This approach has proven substantially suc- 
cessful with intelligence test and ability data. There are, however, sound reasons 
for rejecting this simple mathematical model with reference to the study of the whole 
personality, particularly when attention is directed to dynamic aspects of behavior. 
Maller, ina comparison of personality tests with intelligence tests, brought out 
the basic consideration that: 

While practically all mental abilities show a fairly definite curve of growth from early 
infancy to maturity, no such definite trend of development is observable in regard to many 
aspects of pe rsonality. Even where a trend toward maturity is observable, it is not always 
continuous or consistent or parallel to physiological development. The average adult is more 
capable, more intelligent than the average child, but not necessarily more honest, more co- 
operative or more courageous . . . The qualities of personality are themselves not continuous 
or linear. Courage beyond a certain point may become recklessness; extreme caution may 
become cowardice, etc. From the point of view of desirability such qualities may be consid- 
ered as curvilinear in nature. This may be contrasted with mental abilities which are gener- 
ally continuous and linear. 


And Horst “?, in discussing a related problem, stated: 

Investigators who have worked extensively with case materials have been impressed 
by the fact that it is impossible in many cases to take a general weighted average of all the 
factors involved and use this average as a basis for prediction of marital adjustment, voca- 
tional success, school achievement, behavior on parole, etc. In individual cases certain 
factors are much more important than others in the extent of their influence, and it is the 
configuration of factors which seems significant. 


The linear relationship assumes that the weight given any factor is the same 
for all values of that factor. It also assumes that such weight is constant without 
reference to values of other factors for the same individual. However, in the case of 
most dynamic traits one or both of these assumptions is false. For example, it is 
commonly known that fear and anxiety have adaptive value. Up to a certain point 
anxiety is normal, healthy and desirable. But when it overpowers the individual and 
disrupts his performane e, it is potentially dangerous. At the same time, the intensity 
of the subjective as well as the physiological anxiety reaction depends upon the ex- 
ternal situation which the individual faces, his state of health, fatigue and motiva- 
tion at the time of the reaction, his previous experience and rea actions in similar situa- 
tions, and very likely other factors as well. This example illustrates both facts men- 
tioned: (a) that the weight given an anxiety score must vary for different values of 
anxiety score, and (b) that such weight for anxiety varies with other factors for the 
individual, such as health, fatigue, motivation, previous anxiety experience, and the 
stress of the external situation. Statistical approaches to the expression of such 
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relationships have been discussed by Horst and by Meehl*). Solutions to these 
problems are a pressing need and will contribute much to the progress of science in 
this field. 


CRITERIA 


In the survey of problems of validity the criterion is of greatest importance. 
Too many of our tests are accepted on faith or face validity. This is particularly true 
of projective technics and instruments directed at personality structure and dynam- 
ics. For this reason we will confine our remarks to that area. 

Three approaches to the problem may be reviewed as models. We shall desig- 
nate them as follows: (a) The rational approach. This begins with a hypothetical 
construct, such as a trait or process and develops test items designed to disclose 
evidence of the trait. A criterion is then sought to obtain evidence of validity. A 
variation of this approach is that of seeking criteria for measures already developed, 
such as particular scores derived from the well-known projective technics. (b) The 
experimental approach. This consists typically of;an experiment designed to test a 
particular hypothesis. In this approach the criterion is expressed in the selection of 
experimental and control groups. An example of this approach is a recent experi- 
ment by Drs. Bitterman and Holtzman on an Air Force contract. They tested the 
hypothesis that rate of conditioning and rate of extinction of the conditioned gal- 
vanic response to shock are significantly related to anxiety. Thirty-seven university 
men were rated for susceptibility to anxiety on the basis of psychometric indices and 
performance in a laboratory stress situation and then divided into two groups, de- 
signated as high anxiety and low anxiety. The conditioning tests were conducted 
by experimenters who did not know how individuals were classified. The statistical 
comparison of the conditioning and extinction rates for the two groups confirmed the 
hypothesis with a high degree of significance. (c) The empirical approach. This ap- 
proach begins with a study of individuals behaving in the criterion situation and 
consists of a series of empirical steps to identify and measure factors intrinsic to the 
criterion. Among the most impressive representatives of this approach are the Binet 
and Strong tests. Both are distinguished by the fact that they are built upon criteria 
representing significant life situations. Their item content and specific test designs 
were empirically derived from the study of the criterion. And their trait factorial 
make-up was an @ posteriori conclusion rather than an a priori postulate. 

Of the three approaches we have outlined the empirical is the most ambitious, 
but also one which provides the most conclusive evidence of validity. It is an ap- 
proach which must be conceived in program rather than project terms, as indeed 
were the studies of Terman “®, Strong “*) Wechsler “”), Hathaway © and the Buh- 
lers“), who have followed this path. 

The experimental approach and the rational approach are basically heuristic. 
Both are fraught with difficulties of many kinds. The experimental study is too 
often narrow and specific to its particular design. Nevertheless, as Marx“ has re- 
cently suggested, it may in time yield a range of rigorously investigated results 
which may contribute generalizations of importance. The rational approach, though 
an excellent tool in the hands of a brilliant clinician or insightful researcher, is fre- 
quently sterile. One reason for this is that personality theory has become so clut- 
tered with words that Maslow “) was impelled to propose that particular constructs 
be designated with subscripts to differentiate the nuances of extension intended by 
one author from those of another. Too often the rational approach must be satisfied 
with unreliable ratings as criteria. 

I believe that with a mixture of insight and experimental skill these heuristic 
approaches will contribute much to progress in the development of useful clinical 
tests. We must nevertheless recognize that at best these methods are limited. Until 
tests are standardized by reference to carefully selected populations representing 
specific life situations, such as membership in patient, occupational, age or other 
significant groups, their practical usefulness will be restricted. 








SAUL B. SELLS 
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RESEARCH DESIGN AND METHODOLOGY IN EVALUATING THE 
RESULTS OF PSYCHOTHERAPY 
ROBERT I. WATSON 


Washington University School of Medicine 


Before dealing critically with the present status of research in this field we may 
consider briefly certain of the difficulties facing a research worker in this area. The 
most obvious source of difficulty is the presence of various systematic approaches to 
psychotherapy—psychoanalytic, nondirective and so on. Each of these points of 
view may become proliferated by defections from the ranks. In addition, there are 
the minor unorthodoxies and individual idiosyneracies within the fold which still 
further complicate the issue. One cannot speak of the “effects of psychotherapy” 
in a bald unqualified fashion, but research must be framed in terms of a particular 
approach to psychotherapy with attention to the particular individual nuances given 
by the specific practioners concerned. 

Another difficulty, which is intimately related to that just mentioned, is the 
relative lack, both qualitative and quantitative, of research in this field. One reason 
for this is that many psychotherapists do relatively little research. This is not un- 
usual in the practice of a clinical art as witness the fact that approximately 90 per- 
cent or more of physicians do no research. The practice of psychotherapy is, after 
all, not intended as a scientific experiment. Its aim is prevention and cure by any 
available method, empiric or otherwise. But this immersion in practice is only a 
part of the picture. Whether verbalized or not, many psychotherapists act as if 
psychotherapy were beyond the ken of scientific research. The psychodynamic ebb 
and flow, the cross currents, and the storms and the calms seem to occur on such a 
vast scale as to make them withdraw from charting their deep mysteries even though 
they swim in their waters. Sensitivity to the nuances of a psychodynamic relation- 
ship is not necessarily correlated either in terms of training or of temperament with 
ability to carry on research on this topic. Then, too, the characteristics of a psycho- 
therapist, the deep identification, the critical empathy, the desire to succor the per- 
son in difficulties are not necessarily characteristics which serve to further research 
aims. I might even say that a certain obtuseness in these regards is sometimes needed 
for research planning and production in this field. Fools with a research bent rush 
in where therapeutically sensitive angels fear to tread! It might be well to add that 
these last remarks are not intended as a condemnation of either experience in or a 
talent for psychotherapy on the part of research workers in the field. Certainly 
some modicum of both are necessary. Rather, a flair for psychotherapy is no guar- 
antee of the presence of research talent and sometimes works against it. 

Another difficulty complicating research in this field stems from the fact that our 
knowledge of the etiology of the conditions treated by psychotherapy is meager and 
confused. Clear definition of causative factors is, as yet, impossible. This is a state 
of affairs so clear to the present audience that there is no reason to labor the point. 

To add to the difficulties sketched above it is not enough in my opinion, as some 
of our critics seem to feel, to use a statistical approach alone in an attempt to under- 
stand the effects of psychotherapy. The uniqueness of the individual and the desire 
on the part of the therapist to help him find his own way of solving his problems mil- 
itate against research in which cases are thrown into a statistical hopper, there to 
be mingled with others. Something must be done to preserve intraindividuality in 
research on the effects of psychotherapy. 

In the short time at my disposal I can review no specific studies. Instead, I shall 
summarize certain trends in research, trends regarding which I am sure specific in- 
stances can be supplied from the experience of members of the audience. Studies 
dealing with the effectiveness of psychotherapy have been appearing for many years 
in the psychiatric literature. More lately the social worker and the psychologist 
have concerned themselves with this problem. 
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Probably the most common technique of evaluation is in terms of anecdotal 
impressions without attempt at quantification. Either a complete presentation of a 
single case or a series of shorter presentations, later summarized for their common 
elements and unique features, are used as a basis for these discussions. In the same 
general category are those reports utilizing the results obtained from one or more 
psychological tests. These reports cite change after therapy in individual cases and 
usually offer as verification only selected bits of clinical impression which are con- 
gruent with test findings. Very often some of the data for the research comes from 
the test material secured and then intermingled with other information in the usual 
fashion in clinical practice. For research purposes, this is, of course, a process of 
contamination of data. Although indispensable in teaching and practice, and in- 
deed in illustrating any research study, these studies when presented in this guise 
alone (unsupplemented by other approaches to evaluation) reflect the naturalistic, 
not the scientific phase of evaluation. It is perhaps unnecessary to remind this 
audience that this phase is a necessary prelude to scientific research. Such studies, 
even if not entirely scientifically “‘respectable,” pay tribute to the uniqueness and 
drama of the individual and contain much that is valuable. 

A considerable number of early studies concerned with the effectiveness of 
therapy are found to be summary studies of treatment applied to one or more noso- 
logical groups in which the results are evaluated in terms of general global judg- 
ments. These reports characteristically tell less about the individual patient than 
do the anecdotal studies just considered, and lump together 100 psychoneurotics 
or 500 schizophrenics, and so on. In so doing they gain relatively little and lose 
considerably more in failure to capture intraindividual nuances. These studies 
also suffer from the difficulties created by the unsolved problems of etiological and 
nosological classification, The patients in these investigations are submitted to 
some form or forms of treatment and, after a lapse of time, judgments are made 
concerning the effects. The reports using this approach fail, in almost every in- 
stance, to more than mention the various forms of treatment—somatic, individual 
psychotherapy, occupational therapy, the therapeutic effects of institutional life 
and so on—which the patients received. Moreover, most of these reports concern 
all sorts of treatment without any attempt at differential weighting. Most important 
of all, the criteria of effectiveness that are used are stated in such general terms as 
“cured,” “improved,” ‘‘same,”’ and ‘worse,”’ often with no further identification of 
the bases of judgment. A psychologist, long on experience with research design and 
statistics and short on actual experience with psychotherapy, would view these 
studies askance. In effect he would say, ‘‘How naive one must be to use categories 
of recovered, improved and so on, which categories are only vaguely or not at all 
defined as to the nature of the success in treatment.”’ Then he would jump to the 
conclusion that this naivete must be due to the lack of research training on the part 
of the investigators. Although this may be so in some instances, it is not necessarily 
the case. The sheer complexity of the problem also creates this counting and global 
sorting approach. Nevertheless, there are many grounds for rejecting this way of 
stating the criteria of suecess—the subjectivity of the judgments, the lack of pre- 
cision of the definition, the dependence on the skill of the judge, the tendency to use 
one or at least very few categories of success when usually “‘success,’’ whatever it 
may be, is a matter of degree involving gradations. 

Although understandable at this present stage of research development, the 
approach just described must inevitably be superceded by more precise methods, 
with the important qualification, however, that judgment by rating as a criteria is 
not necessarily dismissed simply because it has been misused. All the aforemen- 
tioned objections concerning the use of ratings can be met by procedures already 
available. Judgments can be made on less global and more precise characteristics, 
steps of the rating scales may be defined; judge reliability may be investigated and 
validity determined by correlation of judgments with independent criteria. 

It is now appropriate to turn to more adequately formulated studies of evalua- 
tion of effectiveness. Here the criticisms are more in the nature of statements of 
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what are considered to be errors of omission. The published studies falling in this 
group are, for the most part, adequate so far as they go, but by failing to take into 
consideration certain factors they are to some degree at least misleading and in- 
complete. 

So far as I am aware, no published research on the evaluation of effectiveness 
of psychotherapy has: 

(1) Considered, even by specification, all the major variables pertinent to this 
field of research. 


(2) Applied uniform, objective measures of effectiveness at onset, close and 
at follow-up. 


(3) Made provision for study of spontaneous remission by use of control 
groups. 


(4) Reported simultaneously both the objective and the individual psycho- 
dynamic findings. 

(5) Arrived at the criteria of effectiveness by other than an arbitrary, though 
plausible, selection from those criteria potentially available. 


Each of these vital but neglected facets of research on evaluation will be con- 
sidered in turn. What kinds of data are needed? “Information is needed about the 
patient himself, his environmental background, the therapeutic situation, and the 
therapist. In other words, data are to be obtained on variables“? intrinsic to the 
individual (patient variables), (2) related both currently and historically to his life 
(situational variables), (3) descriptive of the therapy (therapeutic variables), and 
(4) descriptive of the therapist (therapist variables). The same study may devote 
some thought to more than one of these variables, but no study has simultaneously 
considered all of them within the same group of patients®.”’ This is not to say that 
research studies in this field must simultaneously investigate factors involved in all 
these variables. Rather, there should be a precise specification of the nature of these 
variables in each instance with some discussion at least of their probable influence 
upon the results in order to allow for their adequate interpretation. This, present 
studies have failed to do. A more extended discussion of this problem is offered in a 
recent paper ®?. 

In order to properly evaluate effectiveness, no matter the measures used to do 
so, it appears to me that uniform objective techniques must be applied at the time 
of initial contact with the patient, at the close of therapy, and at the time of follow- 
up at a subsequent date. In published studies to date there is no objective informa- 
tion available at these three crucial points from which change in psychological 
status can be measured. Some studies have reported evaluations of the patient at 
more than one point in the course of therapy but with the criteria for evaluation 
different at the various points in treatment. Several investigators have used psy- 
chological tests at the beginning and again at the close of treatment, but without a 
later follow-up. Other studies begin with the patient at time of discharge from 
treatment and relate the observations made at this point to situational variables. 
Still other studies although involving the three temporal points were of the mass 
scale summary judgment type described earlier. 

What crudely and somewhat inexactly may be referred to as “spontaneous” 
recovery does take place, and without some attention to the implications of this 
phenomenon research in this field is weakened. Change for the better, interpreted as 
effectiveness of the psychotherapy, may to some as yet unidentified degree be due to 
general, uncontrolled, extratherapeutic effects. The most obvious solution is, of 
course, the quantification of its effect through the presence of a non-treated control 
group. There is, of course, nothing to prevent giving this group treatment after a 
delay period, in which case they become their own controls. To the charge that such 
deliberate delay in treatment is socially harmful, one need merely call attention 
to the long delays experienced in busy outpatient clinics in any case. The use of a 
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delay control group merely makes a virtue of a necessity. This technique of a non- 
treated control group, the use of which is not as yet reported in the literature, is 
now in the process of being used in at least two research programs to be mentioned 
later. 

Both for the sake of acceptance by the practicing clinicians, and for the sake of 
completeness, it is necessary to simultaneously consider both objective and psycho- 
dynamic findings. The dynamic character of the material cannot be completely 
sacrificed to objective ends. It appears to me that the case study approach sketched 
and judged incomplete earlier is, nevertheless, necessary. Without it, the intrain- 
dividual aspect is lost. Attention both to uniqueness and commonality is necessary 
for a complete and therefore accurate picture. The difficulty is that so far no re- 
searcher has found it possible or decided it was useful to present detailed informa- 
tion concerning both aspects on the same group of patients. To be sure, illustrations 
on a case or two may be offered by some workers but this is not enough. The sug- 
gestion is offered that parallel and interrelated objective group and intraindividual- 
dynamic studies should be done simultaneously with the same patients. 

The last neglected pertinent area in research to be discussed is the issue of the 
criterion or, more strictly speaking, the criteria of effectiveness. At present we are 
in the unhappy state of not knowing what are the criteria of effectiveness of psycho- 
therapy. The criteria used are arrived at not by study of this as a research problem, 
but by selection. Research has not yet isolated criteria on which there has been 
any sort of general agreement concerning their value as indices of improvement. 
Paradoxically, it is quite easy to decide in advance that a certain specific character- 
istic, e.g., degree of insight or reorganization of basic personality structure 7s indica- 
tive and then. make it the criterion. But have we the right to make such assump- 
tions? I, for one, do not think so. 

In connection with some research on this topic we have taken the position that 
instead of deciding upon a limited number of arbitrarily selected general criteria of 
effectiveness we would formulate as many criteria as possible and secure evidence 
through rating forms, psychological tests, and narrative case history material. 
The results in these criteria studies when interrelated by techniques of cluster 
or factor analysis should tell us something about the criteria to be used in future 
research even though they too emerge from a process of selection but one which 
provides a broader sample of potential criteria. The therapeutic goals in any one 
individual case must, of course, be specified (and these would inevitably vary) but 
they in turn could be related to the more general criteria. 

Since this report is primarily addressed to an audience of psychologists, it might 
be well to single out the work of certain members of our profession for special con- 
sideration. It might be well to emphasize that most studies of psychotherapy by 
psychologists emphatically are not studies of effectiveness. The studies of Fiedler, 
for example, are concerned with the therapist variable, as such, and the techniques 
as used by him will presumably be quite valuable when applied in evaluating effect- 
iveness. This is a task for the future. 

The studies of Hunt and his associates with the movement scale and DRQ, 
are primarily concerned with the therapeutic variable as a measure of effectiveness 
but only incidentally with the patient, background, or therapist variable. They 
might be even more useful in a bréader sphere of research on effectiveness when 
used in a research setting implied by the earlier general criticisms. 

It is to the work of Rogers more than any other man that we, as psychologists, 
are indebted for stimulating our interest in research on psychotherapy. Many re- 
search studies of psychotherapy by Rogers and his colleagues and students, it may 
be remembered, are concerned with the process as such and contribute only very in- 
directly to the study of its effectiveness. They, therefore, cannot be criticized for 
failing to do something they did not start out to do. So far as they are concerned with 
effects they expose to the reader the process taking place between onset, and close of 
therapy (as do those of Hunt) and leave one to judge its effectiveness on this basis. 
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The reported studies by Rogers and his associates are often astonishingly incomplete 
in the information they give concerning the therapist, the background of the patient 
or even the characteristics of the patient. They are especially prone to ignore the 
patient’s characteristics—we know very little about his background, how sick he 
was, the nature of the symptoms he exhibits, pathological processes presumed to be 
operative, the chronicity, the background factors affecting prognosis, ete. It may 
be that from their point of view their investigation and specification is irrelevant but 
this is hardly the case with therapists of other orientations. 

A qualification concerning the foregoing criticisms must be offered. One must 
note that I refer to ‘‘ published” studies. The criticisms so baldly stated are in some 
measure, although perhaps differently defined, a part of the research climate of the 
times, and no claim is made that they are necessarily original with the speaker. At 
least two research programs now in progress may meet some or all of these objections. 
The present research program of Rogers and his associates (familiar to me only 
through the unpublished first and second interim reports) may eventually be in a 
position to meet at least some of these errors of omission. 

It might be mentioned in closing that a research program on evaluation of the 
effects of psychotherapy in part specifically designed to meet these difficulties and 
objections to previous research is underway at the Washington University Medical 
School. Preliminary reports are given in a series of papers“: ?+ 5: 45 6), By no 
means, have all of these difficulties been overcome even to our own satisfaction, but 
at least they are recognized for what they are, and it is hoped that the project will 
supply partial answers, incomplete and preliminary though they may be. 
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CRITIQUE OF SMALL SAMPLE STATISTICAL METHODS 
IN CLINICAL PSYCHOLOGY 
J. R. WITTENBORN 
Yale University 


The major difficulties which accompany the use of small sample statistical 
methods in psychology appear to be related to the experimental designs of psycholo- 
gists and are not necessarily related to misapplications of small sample statistics 
per se. 

Perhaps a part of the difficulty is a semantic consequence of two respects in 
which the word “significance” or ‘‘significant’”? may be used. Level of statistical sig- 
nificance, as is Well-known, refers to the relative frequency with which an effect as 
large or larger than the one under consideration could occur by chance. The size of 
the effect (it may be described as a trend or difference) which is required to meet a 
given criterion for statistical significance varies inversely with the consistency of the 
data and inversely with the size of the sample under observation. If the sample is 
relatively large or the data are very homogeneous, the magnitude of a statistically 
significant difference or of a statistically significant trend may be very small from an 
absolute standpoint and from a practical standpoint be so small as to have no sig- 
nificance whatsoever. 

Nevertheless, from the manner in which such statistically “significant’’ differ- 
ences are seized upon and cherished by experimentalists attempting to verify a 
theory or by clinicians attempting to validate a procedure, one would suspect that 
the technical meaning of a statistical significance has been lost and to this term 
has become attached the layman’s connotation of the word significant (i.e., a signi- 
ficant difference is one of great practical importance and a high level of significance 
implies an extraordinary degree of practical importance). 

Actually, highly significant differences may not only have a trivial relevance for 
the problem under consideration (whether it’s the validity of a hypothesis or the 
value of a procedure), but such differences may not be at all relevant to the problem 


under consideration and may as a matter of fact be an expression of some uncontroll- 
ed, possibly unrecognized factor in timing, scoring, or subjects, ete 

Small sample statistics, parti ularly in the analysis ¢ f the differences between 
means, have seemingly encouraged investigations which from the standpoint of re- 
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Such informality concerning the independent variable results in several evils: 
It- has often led to the sober comparison of unspecified and undescribed extremes of 
the independent variable; as a result the investigator may be able to say how in- 
frequently his result could be ascribed to chance, but the reader is uncertain as to 
precisely what the result may be frequently ascribed. Nevertheless, if under such 
uncertain conditions a statistically significant difference is found, it may be seized 
upon as having a practical significance and accepted as if any other differences in 
the independent variable would also lead to statistically significant difference in the 
dependent variable. Perhaps it is too much to ask investigators to secure sufficient 
data to compute appropriate correlations between the dependent and the independ- 
ent variable, but unless the two samples being compared are qualitatively different 
and only two qualities of the independent variable are relevant to the problem under 
consideration, the reporting of the significance of a mean difference in the dependent 
variable resulting from one difference in the independent variable is at best a half- 
hearted exploration. 

This points to a second evil. Many writers using small sample differences have 
not only ignored the problem of the strength of relationships but in effect have dis- 
couraged recognition of this problem. Such writers have implied that trends either 
exist or don’t exist and if trends are shown to exist, they are then evaluated in terms 
of the degree of statistical significance which may be ascribed to them. The question 
of the strength of the trend is thus ignored and only rarely is there mention of the 
fact that the degree of statistical significance may be a function of sample size only 
and may have nothing to do with the intrinsic strength of the relationship. 

In addition, it may be feared that the comparison of the means of two subsamp- 
les leaves both unrecognized and unanswered questions concerning the essential 
nature of the relationship between the dependent and independent variables. Is the 
relationship continuous or discontinuous? If continuous, is it uniform through the 
range or does it have a curvilinear or logarithmic quality? 

Incidentally, inadequate designs combined with small sample statistics have led 
to a difficulty common to clinical psychology. Statistically significant trends have 
been found for complex variables such as ratios or profile patterns. As a result, 
meanings have been ascribed to these complex variables which in truth are ascrib- 
able to certain of the component variables only. 

Despite the foregoing emphasis on inadequate designs, it is not uncommon for 
psychologists to misuse small sample statistics, per se. In order to make the use of 
small sample techniques appropriate, it is necessary that the small samples corres- 
pond with the mathematical model on which the hypothetical distribution of rand- 
om sample statistics is based. For example, this usually requires that the samples 
be drawn from normally distributed populations. This in turn often means that the 
samples themselves should be essentially normal in their distribution. Often our 
data are not normally distributed. In the case of some statistics this will result in 
standard errors of estimate which are too large and this, of course, will result in di- 
minished possibilities of the differences meeting a statistical criterion for signifi- 
‘ance. When the sample distribution is skewed and the sample differences are not 
significant, it is often possible to show the differences to be statistically significant 
by using transmuted normalized scores based on squares or logarithms of the var- 
iables. Failure to do this may result in failure to reject a null hypothesis which 
should have been rejected. Inasmuch as subtle selective factors in the sampling pro- 
cedure or other unrecognized factors in gathering the data may result in a skewing 
of the sample distributions, inconsistencies between published reports of the same 
basic phenomenon may result. Accordingly, when a trend reported by another in- 
vestigator is not verified by the data, it is important for psychologists in general 
and clinical psychologists in particular to examine their data for marked evidence of 
skewing and to transmute the variables if this is indicated. 

The requirement that samples differ from each other only by chance in all 
respects except those due to the independent variable is feasible for the agronomist 
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to fulfill and not prohibitive for geneticists and many other investigators, including 
rat psychologists. It is particularly difficult for clinical psychologists to fulfill, how- 
ever, and this difficulty has certain practical consequences. For example, it is often 
impossible for clinical psychologists to draw subjects at random from the population 
to which he wishes to generalize. He instead finds himself obliged to investigate a 
sample arbitrarily determined by the convenience of others. This means that he can 
generalize the results of his experiment only to the hypothetical population from 
which his particular sample could conceivably have been drawn at random. In order 
to generalize to the population in which he is primarily interested, additional in- 
ferences may be necessary; this necessity should never be glossed over, instead it 
should be made tediously explicit so that if the writer cannot perform the experiment 
required to test the secondary inference, his reader will at least know the nature of 
the experiment necessary to test this inference. 

A particular respect in which experimenters may fail to satisfy the requirements 
for proper use of small sample techniques is through the employment of a design 
which somehow or another results in allowing the selection of one subject to deter- 
mine in part the selection of another. This can either increase or decrease the hetero- 
geneity of the samples and as a consequence can make the results of the analysis in- 
applicable to the population to which the investigator wishes to refer. In all in- 
vestigations the researcher owes it not only to his readers but to himself to specify 
the population to which his results may be generalized and, if necessary, to distin- 
guish between this population and the one to which he wishes to generalize. 

There is a variety of ways that the introduction of small sample techniques in 
psychology may have been accompanied by certain disadvantages; an example 
would be the tendency for psychologists to discuss and present their studies in terms 
of significance levels instead of in terms of tables descriptive of the data. Perhaps 
reports in terms_of significance levels may be sufficient in fields where the require- 
ments of small sample techniques may be readily met, but in our field where these 
requirements are often compromised, some presentation of the data provides an im- 
portant safeguard. When numerous replicated tests of the hypothesis in question are 
provided by the investigators, full description of the data may not be necessary; but 
when the experimenter believes he is making a crucial test by performing one or two 
comparisons, it would seem best for him to describe his data first and then show the 
results of statistical tests. 

Another respect in which the introduction of sampling statistics may have been 
disadvantageous to psychology grows out of the false assurance of achievement 
which the naive experience after refuting a null hypothesis at some arbitrarily select- 
ed level of significance. Perhaps this effect is most insidious in the cases where it 
leads to post hoc theorizing and plausible ex post facto deductions. In fields of investi- 
gation such as human behavior, where we know so little and can imagine so much, 
a plausible interpretation may be offered for any pattern of results. Unless the in- 
vestigator has first explicitly committed himself to verbalized hypotheses which are 
expressed in terms of his anticipated data, he can imagine that the combination of 
results he finds is relevant to some hypothesis or unverbalized hunch. Without real- 
izing the unwarranted pretensions of his procedure he may then offer his data as an 
evidence for the validity of this hunch instead of modestly saying to himself and to 
his readers that his results, though unanticipated, provide the basis for a plausible 
inference and that this inference should be tested by examining its consequences 
independently in another sample. In some cases this ex post facto theorizing is so in- 
cautious as to result in investigators offering as conclusions tenuous inferences based 
on forced interpretations of statistically significant trends which are so few as to 
represent only the portion of statistically significant trends which would be expected 
if chance alone were operating. 

This same kind of error occurs in a slightly different form when the investigator 
discards the results of a series of ‘‘unsuecessful” experiments and finally publishes 
the results of the suecessful portion of the eventual experiment which supports his 
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hypothesis. In such cases it seems unnecessary to question the sincerity of the ex- 
perimenter. In view of the conventional acceptance of statistically significant events 
as practically significant events, his false assurance of achievement is a natural re- 
action. This hazard can be minimized if investigators always decide ahead of time 
whether a given set of data are to be used as evidence for some particular hypothesis 
or whether they are to be used as an exploration which may provide the basis for new 
inferences which in turn could provide hypotheses for subsequent investigation. The 
confounding of purposes and the opportunistic uses of data may account for the pre-, 
valent use of the word “findings” in our research reports. ‘Findings’”’ can be a basis 
for new inferences. Findings which have an obvious relevance for a well-known 
hypothesis may sometimes be used as an incidental but not altogether trustworthy 
evidence. But in strict usage, no data (whether they may be referred to as findings 
or in some other way) can be both a source of an hypothesis and an evidence for it. 

Most of the foregoing criticisms are directed to the lack of scientific logic or in- 
consistent applications of scientific logic which often characterize psychological 
investigations. Perhaps one of the most curiously illogical reactions of psychologists 
to tests for statistical significance is the occasional claim, either expressed or implied, 
that a lack of statistically significant results has proven the validity of a nul! hypo- 
thesis. Null hypotheses are not proven; as a matter of fact no hypothesis can be 
proven. All we can do is make a null statement of the hypothesis under consideration 
and see if the null hypothesis can be shown to be implausible. If the null hypothesis 
ean be disproven (i.e., shown to be implausible) then we may have added confidence 
in the validity of the positive hypothesis under consideration. If the null hypothesis 
cannot be rejected, the negative statement is not necessarily true, it simply means 
that the observations we have employed have failed to challenge a negative state- 
ment of the hypothesis under consideration. 

Although it would be possible to cite much published material to illustrate each 
of the foregoing criticisms, they are not considered to be applicable to most published 
research. It is not suggested that these may be illustrated more readily in the clin- 
ical literature than in other psychological literature. These criticisms are generally 
applicable and no group need feel immune. For the most part, these criticisms 
should not be interpreted as criticisms of the applicability of sampling statistics to 
psychology; they are instead criticisms of either misapplication of sampling statistics 
by psychologists or of inadequate designs which have involved the use of sampling 
statistics. It seems to the writer that the clinical psychologist who does research 
must be extraordinarily scrupulous in his choice of design and in his use of analytical 
procedures. There are many reasons for this. His studies may have practical con- 
sequences for the welfare of others. Moreover, he is working in a relatively novel, 
uncharted area and does not have a wealth of scientific dogma, convention, and 
methodology to help (or to hinder) him. Finally, he is relatively a newcomer and 
vulnerable to the criticism of his more firmly entrenched colleagues. 
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INTRODUCTION 


Every clinical specialty faces its own specific problems in standardizing rules of 
evidence for evaluating therapeutic claims. Clinical psychology and psychiatry are 
relatively inexperienced in this area as compared with the older medical specialties 
but the nature of the problem is the same. We are in a position to benefit from the 
experience of the older clinical specialties even though the problem of evaluating 
psychotherapeutic claims may be more complex than in other specialties because of 
the large number of variables involved. The history of medical practice is replete 
with examples of new therapeutic agents for which overoptimistic claims have been 
initially made only to be disproved by the weight of accumulating evidence. Indeed 
it may be noted from a comparison of modern textbooks with those of 25 or 50 years 
ago that there is hardly any overlapping in the methods used formerly and today. 
It is only from bitter experience that the medical sciences have learned to establish 
strict rules for the evaluation of evidence concerning therapeutic claims. This paper 
will attempt to summarize the experience of medical and psychological sciences con- 
cerning rules of evidence for the evaluation of therapeutic claims. 


GENERAL OBSERVATIONS 


Before attempting to establish some principles for the evaluation of psycho- 


therapy, it is necessary to present some general observations taken from the history 
of clinical science in general and to make some deductions relative to the broad back- 
ground against which specific claims must be evaluated. 


1. All methods of psychotherapy claim some successes. Dating back to primitive magic, 
Mesmerism, suggestion, autosuggestion, hypnotism, faith healing, Dianeties, conditioning meth- 
ods, psychoanalysis, nondirective methods, directive methods, ete., some suecesses have appar- 
ently been achieved with all of the known techniques in all their permutations and combinations. 

Whether this residual amount of therapeutic success depends upon the basic and irreducible 
effects of suggestion, total push, personal attention by the therapist, desensitization, ete., re- 
mains to be demonstrated by scientific experiment. However, it cannot be ignored. The 
“scientific”? therapist must demonstrate that his methods are more valid than the faith healer’s. 


2. Initial reports tend to be more optimistic than later reports. it has been a common phen- 


omenon that the first-published reports of therapeutic successes with new methods have frequent- 
lv been much more optimistic than later evidence. A number of types of sampling errors and 
artifacts may explain this finding. First, the naive observer may discard negative findings and 
only report data whieh appear to confirm his overoptimistic hopes. This possibility is enhan- 
ced by the scientifie practice of not publishing negative results and reflects an editorial selection 
factor. Second, every gambler is familiar with beginner’s luck, i.e. an “atypical” sample of events 
resulting in a spuriously skewed distribution of data which would be corrected by the collection 
of more adequate samples. There are many examples in the therapeutic field in which early op- 
timistic reports have seemingly been based on skewed samples of data. Third, faulty experimental 
designs in which uncontrolled concomitant variables are the actual therapeutic factors have led 
to the making of premature claims. 


9 


3. Time alone through the operation of natural factors tends to limit many disorders, The 
history of antisyphilitic therapy yields some pertinent examples. First, is the observation that the 
malignaney of syphilis has been markedly attentuated in the white race as immunizing factors 
have been built up. Syphilis does not have the same malignancy it had 100 years ago. Second, 
control studies of the natural history of the disease in untreated cases show that in a relatively 
large number of cases the disease is self-limiting and overcome by the natural resources of the 
organism. 

Similar comments are applicable to psychological disorders such as schizophrenia. Before 
evaluating any new therapeutic agent, it is necessary to know the probabilities that any form of 
schizophrenia will have natural remissions. This knowledge is even more important with develop- 
mental disorders such as certain types of epilepsy which tend to be “outgrown” 
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4. The notorious unreliability of subjective reports and testimonial anecdotes. All methods of 
psychotherapy have their enthusiastic advocates who are ready to give glowing tc stimonials of 
recovery and cure. A common practice is for the therapist to select a patient who has done parti- 


cularly well with his form of treatment and then to publish subjective reports indicating a dram- 
atic outcome. 


5. Subtle selection factors producing atypical samples. Many patients go to a particular thera- 
pist because they believe that he can cure them, i.e. they are ready to be cured. Thus it is prob- 
ably true that faith healing works well because it attracts the very type of suggestible patient 
who can be expected to react favorably. Thus it is that some therapists can make a method work 
(because of situational factors connected with them) whereas others cannot. 


6. Many ‘cures’ are not bonafide because no genuine pathological process actually existed. 
Many clients give exaggerated personality reactions in the relative absence of any genuine path- 
ological process. Even though such clients may report great subjective anguish, it cannot be 
proven that anything was actually wrong with them. For data to be convincing, the degree of 
malignancy must be established. 


7. Failure to properly identify concomitant variables may result ir. irrational formulations. An 
exa mple 1 may be taken trom the history of medical treatment of hay fever. It was early noted that 
hay fever seemed to appear about the time that Marigold (Golden Rod) eame into bloom. The 
golden blossoms attracied attention, so patients were treated by removing them from the presence 
of Golden Rod. Actually, the less conspicuous but concomitantly blooming Rag Weed was the 
etiologic agent but failed to be identified until subsequent research demonstrated the relation- 
ship. Concomitant variables in the form of suggestion, total push, etc., may be considered to 
underlie every psychotherapeutic session and their effects must be partialled out. 


8. Long term follow-up studies are necessary to demonstrate whether the “cure” is permanent 
or whether the client is simply in a stage of remission of symptoms. It has long been recognized that 
the effects of such concomitant variables as suggestion are only temporary and that the symptoms 
usually return after a variable period either in the same or different patterns. Suggestion effects 
typically operate during the period when the client is maintaining contact with the therapist and 
for a variable duration thereafter before a relapse occurs. 

9. The ultimate validation of any new therapeutic method has usually depended upon collection 
of large scale data from many independent centers. 

In general, validation studies conducted in single centers even though very intensive have 
not proven to be too dependable. For example, the history of the medical treatment of bacterial 
endocarditis has been replete with overoptimistic claims for various therapeutic agents which 
have been later disproven by the accumulation of independent data. Many new remedies have 
been hailed as wonder drugs only to fall into disrepute in a few years. The medical fad of using 
acidophilus milk to change bacterial flora in gastrointestinal diseases is a case in point which 
demonstrates the dangers and difficulties in the evaluation of data. The fact is that the most 
reputable schools and clinicians have sincerely sponsored new remedies on the basis of an invalid 
analysis of data. The trouble is that these erroneous methods usually live for ten or twenty years 
befcre they are killed by accumulating evidence of their inefficacy. 

10. In all clinical specialties, one of the most difficult problems in the assessment of thera- 
peutic claims has been to partial out the effects of Authority and Dogma, 

The history of all forms of therapy has been most significantly influenced by the personal in- 
fluence of intellectual leaders and founders of new schools. The tendency of such leaders to assume 
positions of prominence in teaching institutions, scientific societies and in political situations has 
made it extremely difficult to distinguish between fact and fancy. This has been particularly 
true in historically early periods in the development of any specialty when scientific data were 
largely lacking and published formulations consisted of a tangled mixture of fact and speculation. 
These effects have been particularly evident in the psychological sciences where such names as 
Kraepelin, Freud and lesser luminaries have dominated the field for scores of years. 

11. Inthe absence of convincing evidynce of the nature of etiologic factors and therapeutic mech- 
anisms, claims are not very convincing. 

Electroshock treatment may seem to “work” on empirical evidence, but until objective data 
become available concerning mode of operation it cannot be claimed that electroshock operates 
as a specific agent. 

Basic PRINCIPLES 


On the basis of a careful consideration of all factors which may be operative in 
any therapeutic methods, it is possible to outline a number of basic conditions which 
must be satisfied before the validity of therapeutic claims may be considered estab- 
lished. These principles are as follows: 

3 It MUST BE DEMONSTRATED THAT A GENUINE PATHOLOGICAL PROCESS ACTUALLY EXISTS. 
Until the definite nature of a morbid process has been qualitatively and quantitatively 
demonstrated, it cannot be assumed that any therapeutic action has taken place. 
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IF POSSIBLE, THE ETIOLOGIC FACTORS UNDERLYING THE PATHOLOGICAL PROCESS SHOULD 
BE IDENTIFIED, AND THE DEGREE OF MALIGNANCY ESTABLISHED. 

If therapy is to be specific and rational rather than blind and “shot-gun’’, it must be 
directed toward the removal or amelioration of the known etiologic factors which predis- 
pose, precipitate or maintain the morbid process. 


Iv MUST BE DEMONSTRATED THAT THE MORBID PROCESS IS NOT SELF-LIMITING CR IN A 
SELF-CURATIVE PHASE OF ITS NATURAL HISTORY 

In order to prove that time alone or the client’s own recuperative powers did not 
spontaneously cure the pathological process, enough must be known about the natural 
history of the disorder to rule out the possiblity ol spontaneous remission. 


THE EFFECTS OF SUCH CONCOMITANT VARIABLES AS INCREASED ATTENTION, SUGGESTION, 
TOTAL PUSH EFFECTS, ETC,, WHICH SEEM TO BE PRESENT IN ALL FORMS OF PSYCHOTHERAPY, 
MUST BE RULED ovutT. AS FAR AS POSSIBLE, CONCOMITANT VARIABLES MUST BE 
IDENTIFIED. 

The problem of concomitant variables is more complex than is commonly recognized. 
Much of the « — iveness of psychotherapy may depend upons uch usually uncontrolled 
variables as the personality of the therapist, the subtlety of his use of suggestion effects, 


ete. 


EXTERNAL CRITERIA OF THERAPEUTIC SUCCESS MUST BE UTILIZED. 

Many of the invalid therapeutic claims of the past have been based on a type of cir- 
cular reasoning in which the therapist hypothesizes that his method should operate in a 
given way, then he creates a situation in which he does what he says he does, and then 
believes that the method works because he finds a patient (or group of patients) who seem 
to improve under his care. Scientific validation depends upon external criteria showing 
freedom from symptoms, freedom from incxzpacitation in all areas of life, and a genuine 
deep reorganization of personality 


THE BURDEN OF PROOF RESTS UPON THOSE WHO MAKE THERAPEUTIC CLAIMS. 

In order to be scientifically sound, the responsibility rests upon the therapist to 
demonstrate that he has recognized and experime ntally con rolled all of the variables 
which are known to influence therapeutic results. 


ADEQUATE FOLLOW-UP STUDIES PREFERABLY CONDUCTED BY INDEPENDENT OBSERVERS 
ARE NECESSARY TO ESTABLISH THE PERMANENCY OF THERAPEUTIC EFFECTS. 

In the fiel " of psychother: ‘py, it As parti icularly important to conduct long term = 
up studies to demonstrate whether “cures” are only on —— atte levels or whether ¢ 
genuine dynamic recrientation of pe somali has occurred. Ideally, these studies pet 
be conducted by neutral observers to partial out the effects ‘of sugg: Fe and other subtle 
concomitant variables such as the desire of the client to please the therapist and not to 

offend him by reporting tailure.cf the treatment. 


Iv MUST BE DEMONSTRATED THAT THE PSYCHOTHERAPEUTIC EFFECTS ARE A FUNCTION OF 
DEEP PERSONALITY REORGANIZATION RATHER THAN SIMPLY OPERATING ON SYMPTCMA- 
tic LEVELs. 

In order to acccmplish this proof, it is necessary to (a) identify etiologic factors, (b) 
demonstrate that the ther: upe utic method operates specifically against the etiologic fact- 
ors, (¢) study the relationships actually operating to produce the therapeutic results, (d) 
demonstrate that these results cannot be explained by the operation of concomitant var- 
iables, and (e) demonstrate that the results are reproducible in other clinics on the same 
case material. 


ULTIMATE VALIDATION MUST DEPEND UPON THE COLLECTION OF LARGE SCALE STATISTI- 
CAL DATA. . 


Evidence from single cases, no matter how intensive and logical, must be considered 
to have suggestive value only. Final validation must de pend upon the accumulation of a 
statistically adequate sample of cases of known consistency. 


NOTHING MUST BE ACCEPTED SIMPLY ON THE BASIS OF THE WEIGHT OF AUTHORITY OR 
DOGMA. 

This axiom may seem self-evident and reflecting on the professional sophistication of 
scientific psychologists and psychiatrists. But the fact remains that many of our most 
distinguished colleagues appear to be guilty of “elimbing on the bandwagon” and enthus- 
iastically identifying themselves with the prevailing winds of the moment. The sudden 
uncritical acceptance of Freudianism and other isms by some psychologists who formerly 
identified themselves as experimentalists is a Case in point. 
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XI. COMPARATIVE STUDIES NEED TO BE DONE TO DEMONSTRATE THE RELATIVE EFFICACY OF 
VARIOUS METHODS ON THE SAME CASE MATERIALS, 

Some advocates of new methods have seemed to assume that their techniques are 
most desirable simply because some therapeutic effect can be demonstrated. Actually, 
their methods may work but others may be much more effective. In medicine, many mem- 
bers of the barbiturate family have been discovered but to date few if any work better than 
the old reliable phenobarbital. We need to accumulate data concerning relative efficacy. 


DISCUSSION 

In our opinion, few if any of published studies of the effects of psychotherapy 
may be considered to have made valid use of rules of evidence in making claims. At 
the 1951 meeting of the American Psychological Association, one of the symposia 
was given over to the critique of extensive research conducted by the Chicago group 
on the nondirective treatment of selected single cases. Although a tremendous 
amount of work had obviously been expended on the attempt to establish a rationale 
for the method of treatment, the failure to control many of the concomitant var- 
iables which are known to operate in all psychotherapy largely invalidated the con- 
clusions which the authors attempted to draw. For example, much weight was placed 
on card sorts made by the client at different points in treatment in which he made 
evaluations reflecting his own Self-concepts and which were introduced to show that 
significant changes in Self-concept were occurring during psychotherapy. In a 
question presented later by this writer from the floor of the meeting, the authors 
were asked if they had any data on what clients undergoing Dianetic treatment 
would have done with the same card sorts. This question was asked seriously in 
view of the fact that it is known that all methods have sincere converts, that these 
admirers consistently report favorable changes in their condition sometimes even to 
the point that they claim to have entirely new personalities, that in some instances 
these ‘cures’? may be relatively permanent, ete. Even though this intensive study 
represents some of the most advanced work yet accomplished, its value would seem 
to be largely lost because of failure to consider rules of evidence before designing re- 
search reports and evaluating them. The criticisms advanced by us were intended 
to stimulate more logical thinking concerning rules of evidence. 


SUMMARY 


This paper attempts to establish rules of evidence for the evaluation of psycho- 
therapy. Some general observations have been presented from the history of clinical 
science in general illustrative of typical difficulties in the evaluation of research evi- 
dence. Eleven basic principles underlying rules of evidence are outlined. 
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INTRODUCTION 


It is impossible to objectify and evaluate the results of therapy without some 
type of a quantitative index relating to the malignancy of the morbid process with 
which we are dealing. The history of medicine is replete with examples of conflicting 
clinical claims which were not resolved until all the factors in the situation were 
quantified so as to objectify the actual nature and malignancy of pathological pro- 
cesses. One of the best examples of clinical confusion resulting from failure to quanti- 
fy the malignancy of disorders occurred in the field of tumor therapy where a dispute 
raged for years concerning the relative merits of surgery vs. X ray. This dispute was 
not settled until an objective system was established for rating the malignancy of 
tumors and for systematically collecting data concerning the results of various types 
of therapy with tumors of known malignancy. In several of the medical specialties 
‘tumor registries’ have been established for the purpose of collecting statistical data 
over a period of years concerning the prognosis of various types of tumor as treated 
with various methods. An important part of the work of these projects is to secure 
follow-up data at intervals of one, five, ten and twenty years in order to obtain con- 
clusive evidence concerning outcomes. In general, a case is not considered to be 
completely cured until ten or twenty years have passed without any recurrence. In 
medical science, the use of prognostic indices has done much to objectify methods for 
evaluating the results of therapy and to remove inevitable sources of confusion from 
the field. 

In the field of psychotherapy, there has been practically no progress in the mat- 
ter of evaluating the relative efficacy of various methods because of failure to quanti- 
fy the nature and malignancy of the clinical case materials under study. Adherents 
of various schools of psychotherapy have made a large number of claims and counter- 
claims which cannot be proved or disproved until the exact nature of the clinical 
material on which the findings are based has been established. It is the purpose of 
this paper to outline a method for quantifying the malignancy of case materials by 
the use of a rating scale which has been called the prognostic index. 


RATIONALE 


This seale is the result of several years of empirical experimentation with the 
prognostic value of a number of factors which have been classically regarded as re- 
flecting the malignancy of the disorder. On theoretical bases it seemed possible to 
identify a number of clinical factors relating to prognosis which could be rated fairly 
simply by personnel of average clinical competence. In its present form, the scale in- 
volves objective ratings of five principal factors listed below. 


1. Malignancy of symptoms. In psychiatric practice, it is generally agreed that some symptoms 
have more prognostic value than others. For example, organic symptoms are usually regarded as 
having more grave prognosis than functional symptoms. Some symptoms such as headache are 
of little differentiating value while others have pathognomie significance. Hallucinations are of 
more grave prognostic significance than “‘nervousness’’. As a rough differentiating index, the 
standard syndromes have been arranged on the scale according to their malignancy ranging from 
simple behavior disorders or maladjustments to severe functional cr organic psychoses. 

2. Trend of disorder. It isimportant to discover whether a morbid process is improving, unchang- 
ing or progressive. In some types of disorder, e.g. convulsive states, quantitative data concerning 
frequency and severity of attacks is easily collectable. Progressive disorders are of poor prognostic 
outcome. 


3. Chronicity. Although chronicity is probably correlated with trend of disorder, it appears to 
involve a different type of measurement. In general, the longer a disorder has lasted before the 
beginning of therapy, the graver the prognosis. Similarly, chronicity in spite of adequate treat- 
ment is a poor prognostic sign. 
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4. Incapacitation. In our opinion it is desirable to make an objective estimate concerning the 

degree of incapacitatron at work and play as distinguished from the subject’s own ratings of sub- 

jective symptoms. After all, on a relative basis, the degree of incapacitation is a fact of great 
practical importance in evaluating mental status. A person may be greatly incapacitated by 

a mild disorder, or may show only slight incapacitation from a serious disorder. 

5. Subjective status. An attempt is made here to consider the client’s subjective status as he per- 
t himself, and also to estimate the degree of insight shown. Psychiatrie practice is in agree- 
iat the degree of insight is a very impertant prognostie factor. 

We have also considered adding a sixth factor, level of personality resources, 
which would attempt to estimate the resources available with which to work in any 
given case. This factor would include level of intelligence, history of previous emo- 
tional stability, physical attractiveness, and also an estimate of economic and social 
resources in the environment. For the present, this factor has not been included be- 
cause of the complexity of its quantification. 


THE ScALE 


Table 1 presents an outline of the Prognostic Index indicating the criteria which 
are suggested for evaluating various grades of malignancy as measured by the five 
factors. For convenience, the grading has been arranged on a five point scale rang- 
ing from minimal to severe malignancies. This arrangement is patterned after 
standard medical and psychiatric practice. 

The descriptive criteria which have been suggested for each category of severity 
are offered tentatively and in several instances néed further refinement. Relating 
to the factor of chronicity, two scales are presented since it is recognized that the sig- 
nificance of chronicity is related to age level. The exact values have been chosen 
arbitrarily. 

Two types of quantitative ratings may be assigned to clients after evaluation 
with the rating scale. The first and more qualitative rating consists of a five unit 
figure, such as 23224, reflecting the grade of malignancy on each of the five factors 
rated. Thus the client in this illustration would have a rating of 2 on malignancy of 
symptoms, 3 on trend of disorder, 2 on chronicity, 2 on incapacitation, and 4 on sub- 
jective status. These figures constitute a profile which yields information at a glance 
concerning the factors of most grave prognostic significance. In our experience, in- 
formation of this type is of great clinical value. 

The second type of rating consists of a summation of the grading on each of the 
five factors to yield a single over-all score of the degree of malignancy. These num- 
erical ratings are obtained by simply adding the individual gradings on each factor 
and then assigning a rating from table 2. The higher the total score, the more grave 


} 


TaBLe 2. NuMERICAL RATINGS OF PROGNOsIS 


tating Scores 


Minimal 
Mild 
Moderate 
Moderately Severe 


Severe 


the prognosis and the more malignant the severity of the case. If desired to convert 
the scores to a full range of 100 points, this may be accomplished by multiplying in- 
dividual or total scores by 20 or 5 respectively. 
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VALIDATION STUDIES 
At present levels of refinement, the prognostic index can be expected to provide 
only a rough indicator of degree of malignancy. Several of the factors rated, includ- 
ing malignancy of symptoms and trend of disorder, require considerable clinical ex- 
perience and the results from different ratings show marked variations in reliability 
and validity. Rather than to allow rators to become involved in minute differentia- 
tions, we have requested them to arrive at a rough decision quickly and not attempt- 
ing to make finer differentiations than on a five point scale. To date we have not yet 
accumulated large enough samples of cases in the various clinical categories to con- 
sider statistical significance. This study is therefore regarded only as a preliminary 
attempt to establish a pattern. Available results on small samples indicate satisfact- 

ory reliabilities on individual and group judgments. 


CLINICAL APPLICATIONS 


In planning experimental designs, it is highly desirable that experimental and 
control groups should be carefully equated for the nature and malignancy of the dis- 
order under study. Unless the nature of the case material is objectified with regard 
to some standard reference points, it will be impossible to draw any valid conclusions. 
It is suggested that the matter of validating and standardizing official indices of prog- 
nosis should be made the responsibility of a committee appointed by the AMERICAN 
PsyCHOLOGICAL AssOcIATION perhaps acting in cooperation with a similar com- 
mittee representing the AMERICAN Psycuiarric Association. This is a matter of 
the highest importance for the profession as a whole since it involves the establish- 
ment of standards which will need to be applied by all working in the field if the de- 
sired objectives are to be obtained. It is probably beyond the resources of any one 
individual or private group to collect data of the magnitude necessary to effect 
proper standardization and evaluation of comparative results. In the same manner 
as some of the medical specialty groups have established official registries for the 
systematic collection of research data on major problems, it seems desirable for some 
psychological organization to undertake a similar project. Under existing conditions, 
Federal agencies appear to be the only groups with large enough clinical materials 
and research facilities to conduct comparative studies of the effects of various thera- 
peutie methods on carefully equated clinical groups. The few researches which have 
been reported to date on the results of treatment by psychoanalysis or nondirective 
methods have been conducted with so many factors left uncontrolled that their re- 
sults are of slight significance. 


SUMMARY 


The Prognostic Index is a rating scale designed for the purpose of measuring 
and quantifying the malignancy of psychological disorders. It attempts to rate five 
factors including the malignancy of symptoms, trend of the disorder, chronicity, in- 
‘apacitation and subjective status as reported by the client himself. Ratings may be 
expressed either in terms of a numerical profile referring to each of the five factors 
or by a total score representing the summation of factors. It is suggested that wide- 
spread adoption of the prognostic index would result in a much needed refinement in 
equating clinical groups. Successive ratings made on one client with the Prognostic 
Index may also be used to compare the mental status or the efficacy of therapy at 
different points in the history of a psychological disorder. 
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Studies of the effectiveness of counseling and psychotherapy have used a variety 
of criteria as measures of success. Some researches have employed rating scales of 
differing degrees of elegance, others have relied upon psychological tests, while still 
others have emphasized physiological and environmental measurements. The be- 
havior measured in such studies was sometimes quite specific, as changes in respira- 
tion or pulse rates, and sometimes very broad and elusive such as changes in self 
concept. The therapeutic methods employed ranged from directive through eclectic 
and nondirective procedures. In every case the attempt was made in the approaches 
discussed here to measure the clients’ status before therapy, after therapy, and in 
some studies, during therapy. 

Each approach has certain advantages and disadvantages. The task of the 
present paper is to describe the various approaches currently in use and comment 
briefly upon their limitations and advantages. While different groupings of measures 
before and after therapy may be made, the following six categories are considered to 
include the methods actually in current use: 


taTINGS. The client (or another person who knows him) evaluates his adjustment 
either by verbal statements or by using a rating form or questionnaire. The ratings 
are made before therapy has been fully instituted and again at any desired intervals 
during or after therapy. A large number of specific symptoms such as indigestion, 
tremor, tachycardia, palmar sweating, etc. may be covered in the rating form, or 
only broad aspects of behavior may be evaluated, such as feelings of guilt, hostility, 


etc. The rating form may vary from a simple list in which symptoms are checked as 
present or absent to scales which indicate the amount of improvement or the degree 
to which given symptoms are present. Various combinations of rating forms may be 
used. 

The major virtues of rating techniques lie in their convenience and accessibility. 
They provide a comprehensive estimate of adjustment not obtained by other meth- 
ods. In many clinical situations no other method is feasible. Unfortunately, the re- 
liability of ratings is usually low, and their validity is difficult to establish. Careful 
construction and pretesting of such rating forms can remedy these defects substant- 
ially; however, the necessary effort seems to be put forth but rarely. 

a. Self-ratings by the client or patient, either in the form of oral report or 
printed rating forms, are by far the most commonly employed “before and after” 
methods of evaluating psychotherapeutic effect. Even when research orientation is 
not present, the patient’s statements as to whether he feels he is ‘‘ getting better” or 
“getting worse’’ are noted and routinely included in the interview records as a kind 
of measure of therapeutic progress. If the patient does not volunteer such opinions, 
he is often directly asked how he feels or how he is getting along. In the present 
writer’s opinion such self-ratings are so weak as to be almost worthless. Indeed, 
they can be dangerously misleading at times because the patient or client commonly 
reports that he has been benefited to some extent at least. He reports favorably be- 
cause he must. He is placed in the position of acknowledging either that he was 
helped by the therapy or that he stupidly continued to waste his time and that of 
the counselor without positive effect. Further, as Hathaway“ has observed, our 
culture demands a favorable overt response in such “ Hello-Goodbye”’ situatiohs re- 
gardless of true feelings. A more subtle approach may conceivably be developed 
which will surmount these obstacles to the valid use of self-ratings. 

b. Ratings by cthers, as members among the patient’s family and friends, have 
been used occasionally to assess therapy; but they are often awkward to get or not 
feasible professionally. Also, the patient’s fellow-workers, relatives, and others who 
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know him well may be so emotionally involved in the problem that they cannot be 
objective. 

ce. Clinical ratings by competent psychotherapists typically occur in psychiat- 
ric or similar case reports. The patient’s pre-therapy status is compared to his status 
after treatment with descriptions as “improved,” “much improved,” ete. Pooled 
clinical ratings are often used to decide whether certain treatment should be con- 
tinued or other treatment instituted. Clinical staff sessions usually contain many 
examples of pooled ratings. 

Most studies of psychotherapeutic success, either directly or indirectly, lean 
heavily upon such clinical judgments. Even evaluation studies which purportedly 
include only psychological tests or the patient’s self-ratings are influenced by clinical 
judgments. Such studies commonly use cases where therapy has been formally term- 
inated, and obviously the clinician’s judgment is a deciding factor in the termina- 
tion. Further, if psychological tests are used for the evaluation of therapy, clinical 
ratings are very likely to have been used in the development of the tests themselves. 
Using the MMPI, for example, does not free one from clinical ratings, since the scales 
were originally derived from clinical judgments. 

Because clinical judgments by psychologists and psychiatrists are so pervasive, 
it does not follow that such judgments should be regarded as highly reliable and 
valid. Indeed, there is often evidence to the contrary. For example, Schofield“? 
failed to construct a ‘‘susceptibility to therapy” scale for the MMPI chiefly because 
a number of patients used in his criterion group were clinically rated as “‘improved” 
and discharged from formal therapy when there had apparently been no such im- 
provement. He found that a number of discharged, ‘“‘improved” patients were short- 
ly re-admitted to the hospital, indicating that the recorded improvement was quite 
temporary or non-existent. 

In general, it may be said that the value of clinical ratings is a function of the 
clinician. Some clinicians do a competent job of evaluating adjustment and others 
do not, judging from re-admission rates of ‘‘improved”’ discharged patients. 


PsycuoLocicaL Trst-Ruresr Tecuniques. This approach to the problem of eval- 
uation employs psychological tests administered prior to therapy and again at one 
or more intervals afterward. Various tests have been utilized in this way. Muench °® 
Schofield “?, Cowen and Combs’, to name but a few, have used the Rorschach, Bell 
Adjustment Inventory, the Bernreuter Personality Inventory, and the MMPI. Since 
the scoring of most such instruments is objective, there is some advantage in this 
method. However, the objective scores can engender an unwarrs inted feeling of con- 
fidence because few psychological tests are actually validated i for therape utic changes. 
The design of some published researches, for example, makes it impossible to deter- 
mine whether the tests or the therapy was being evaluated. While changes in test- 
retest scores may be highly significant statistically, it is not clear what such changes 
may signify for adjustment nor what their relationship to therapy may be. Does re- 
gression, toward the mean indicate better adjustment? If so, test-retest data on 
normal, non-therapy groups sometimes reveal better adjustment. Is it not possible 
that a marked rise in score for some characteristic like compulsiveness could indicate 
improvement in certain personalities? Also, may not tests sample too narrow a 
range of behavior in many instances to reflect broad therapeutic change? When 
tests are validated for the therapy situation, we shall have the answers to these and 
similar questions. 

Careful research design can attack such problems successfully, often with no 
greater, or even less, effort than would otherwise be needed. Ldwards and Cron- 
bach’s contribution to the present series of papers offers several excellent suggestions 
for design. McQuitty“* © and others have presented some challenging ideas for 
measuring person: ility integration. 


PHYSIOLOGICAL AND ORGANIC MEASURES. Since physical symptoms are often caus- 
ally associated with emotional disturbances, the assumption has been made that 
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improved adjustment will be reflected in the disappearance or diminution of the 
physical symptoms. Measuring such symptoms before and after therapy should pre- 
sumably yield scores of psychotherapeutic efficacy. Any symptom or syndrome 
could be used, 1.e. tics, tremors, perspiration, gastric distress, stuttering, etc. Thet- 
ford © for example, employed measures of galvanic skin response, heart, and respir- 
ation rates before and after nondirective therapy with ‘‘frustrated” and control 
groups. 

Many physical symptoms are capable of precise measurement by highly sensi- 
tive instruments which may also provide permanent records. To the extent that the 
particular symptoms relate directly to adjustment, such physiological measures pro- 
vide the most objective assessment of therapeutic success available. This very 
virtue, however, is a possible source of literally carnal scientific sin. That is, the 
measured symptoms may be singled out and little attention given to other broad 
aspects of adjustment. Worse, the symptoms alone may be treated. A sedative, for 
example, may reduce the tremor but not the basic conflict. There is a further compli- 
cation in that a given symptom may occasionally persist when general adjustment 
has markedly improved. 

These problems are minor ones and readily avoided or controlled when the ex- 
perimenter is aware of them. While little research on physiological and organic 
‘before and after” measures has been published, what has been done indicates that 
this area bears bright promise for the future. 


I.NVIRONMENTAL AND ACHIEVEMENT CorRRELATES. The basic assumption of this 
approach is that the better adjustment or better self-understanding resulting from 
therapy will be reflected in the patient’s or client’s achievement. Studies using this 
method have examined such myriad factors as academic grades, salary increases, 
absenteeism, promotions, accident rates, social activity as dating, ete. Super ©? 
describes a number of these criteria and relates them neatly to the personal values of 
the individual, an oft-ignored aspect. 

Group or institutional therapy programs frequently employ methods like these. 
The armed forces reports on the number of NP casualties returned to active duty, 
or the Williamson and Bordin ©? study of counseled and non-counseled students are 
examples. Individual case reports sometimes refer incidentally to such correlates as 
evidence of improved adjustment resulting from therapy, i.e. the withdrawn patient 
who acquires social skills like dancing, ete 

The use of environmental or achievement correlates requires a value judgment 
on the part of the investigator, namely, that the observed change validly indicates 
better or poorer adjustment. At times this may be misleading. Lower academic 
grades for example, may not signify poorer adjustment but merely that “boy has 
met girl,’ often a healthier symptom than higher grades. If ordinary precautions 
are taken, however, such errors are not likely to plague any alert investigator. In 
fact, the chief obstacle of this technique is its execution. Since its application usually 
requires careful follow-up procedures, a great deal of time is needed for obtaining 
accurate information. Even when the necessary time is available certain types of 
data, as salary levels, disciplinary actions, etc., may be exceedingly difficult to secure. 
This is particularly true of negative cases, i.e. where disciplinary action was necess- 
ary; hence a falsely roseate picture may result because no information can be ob- 
tained for negative cases. 


VerBAL Benavior. This approach postulates that the patient’s speech, what he 
says and how he says it, bears a reasonably close relationship to his emotional ad- 
justment. Both clinical and common experience would give support to this assump- 
tion. Anyone who has heard the labored utterances of a depressed patient, the 
neologisms of a schizophrenic, or the staccato imprecations of an enraged truck- 
driver would agree. Sanford * has an excellent summary of speech and personality 
relationships. 
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Recorded interviews and their typescripts are usually the raw data for this 
method of evaluating therapy. Various analyses have been made. Davis and Rob- 
inson®) and others have studied the verbal content of therapeutic interviews. 
Boder ®), Busemann “), Mann “*) are among those who have used the relationships 
of verbs (active expressions) to adjectives (qualifying expressions) as measures of 
emotional adjustment. With similar intent Grummon “? has studied client negative 
expressions during therapeutic progress, and Fairbanks“? has examined the propor- 
tionate vocabulary of schizophrenics, i.e. the frequency of words like “‘T’’, ‘“‘me,”’ ete. 

Although not widely done, research on verbal behavior should help much in 

evaluating therapy and help even more in understanding the therapeutic process. 
One reason for the dearth of studies in this promising field is the heavy cost of com- 
plete interview typescripts and the drudgery of analysis. Another reason is that 
normative speech data are sparse for normal persons in a social situation such as the 
interview. Thus any changes in verbal behavior which may be found during a series 
of therapeutic interviews cannot be measured against normal speech. It is to be 
hoped that time will fill this vacuum. 
EXPERIMENTALLY INDUCED MALADJUSTMENT. This method creates maladjustment 
or activates a dormant conflict in the subject. After a period of therapy, tests of 
some kind are made to determine whether the problem still exists. While these tests 
are often of the sort described in the five categories earlier mentioned, this approach 
is so flexible and fraught with possibilities of crucial studies that it deserves a place 
of its own. While most of the research of this type has been done with animals, as 
that of Maier“), there have been several studies in which human subjects were 
used. Keet “!), for example, used students in an ingeniously designed investigation. 
By means of a word association test, he uncovered an area of conflict and then used 
directive and nondirective therapy, followed by tests, in a miniature counseling situa- 
tion. The study by Thetford“, referred to earlier, employed frustrating situations 
and physiological measures of therapeutic effectiveness. 

A major difficulty with experimentally-produced maladjustment is that the 
induced maladjustment may be so miniscule or so transient as to be undeserving of 
the name. On the other hand more robust approaches may produce an opposite 
problem, possibly a disasterous one, if a subject is unstable. Perhaps that is why 
this technique has been used so seldom. But for the experimenter who can avoid 
both Seylla and Charybdis, the rewards can very well be studies capable of settling 
basic points of dispute concerning the efficacy of therapeutic methods. 


SUMMARY 


GENERAL Remarks. In describing the various “before and after’? measures suitable 
for evaluating therapeutic success, attention has been deliberately directed to some 
of the snares inherent in each approach which may entrap the unwary investigator. 
Two more general precautions which apply to all approaches should be mentioned. 
These are provisions for control over the mere passage of time (as Hathaway “ has 
noted) and the importance of separating the effects of formal therapy from incidental 
influences which have therapeutic value. 

It should be emphasized in connection with the first precaution that a number 
of maladjusted persons will eventually improve without any formal therapy at all, 
and some others will probably improve in spite of therapy. A non-therapy control 
group is obviously important. A useful additional control would be a provision for 
evaluating completed therapy over long time intervals, years 1f possible, to measure 
the enduring effects of treatment. While difficult, such long-interval follow-up 
studies will be absolutely necessary before adequate judgment on psychotherapeutic 
methods can be passed. The second precaution is essential because clinicians are 
prone to apply the Law of Parsimony in a curious, two-pronged fasnion. If the ther- 
apist’s efforts are hindered or halted by an environmental or other outside influence 

- beyond his control (i.e. the patient’s loss of job or loved one), he recognizes that 
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formal psychotherapy has been rendered temporarily or even permanently ineffect- 
ive through no fault of his own. He is quite aware of such setbacks when he eval- 
uates his professional activity. Yet a patient under therapy, on the other hand, may 
sometimes derive positive therapeutic effect from a salary increase or the departure 
of a mother-in-law, reducing his anxiety to a point where he can cope with it. In 
such cases the clinician is quite likely to attribute the patient’s improvement to his 
own professional competence. Because of this tendency any evaluative research 
should strive to distinguish between formal therapy by the clinician and the inci- 
dental therapeutic effects of outside influences. 

A number of publications provide further information concerning evaluation 
procedures for counseling and psychotherapy. The June, 1950 issue of The Psycho- 
logical Service Center Journal is wholly devoted to this topic. The books by Rob- 
inson “7) eh. ©) Wrenn © eb 17) Hahn and MacLean ©: ch. 2), Blum and Balin- 
sky “ ch. 18), Br: ayfield @, part 6) and appropriate sections of The Annual Review of 
Psychology all furnish a useful orientation to the problem of assessing psychothera- 
peutic effect. 
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One way to evaluate experimental design in research in psychotherapy would 
be to consider completed studies. At this point, however, there are next to no 
studies which take advantage of formal experimental design. In the absence of such 
experience, we can obtain considerable guidance from experiments in psychology 
and education where similar complexities arise. 

The problem of research in therapy is essentially one of evaluating the effect of 
a treatment, i.e., ““How did this procedure change this individual?” Comparable 
questions are raised whenever a teacher seeks to evaluate the effectiveness of an edu- 
cational method, and we have a generation of experience of educational research to 
teach us what mistakes to avoid. Psychology, whether studying the effect of food 
intake on maze running, the effect of varying amounts of practice on learning, or the 
effect of a group on an individual’s reference frame, is continually examining the 
consequences of treatments. This cumulated experienc e with situations, some of 
which can be well controlled and easily replicated, is a weleome source of light on 
the present problem, for from the point of view of ev valuating the effects of treat- 
ments, the problems of research in psychotherapy are fundamentally the same as 
those of psychology generally. 

Special problems do arise in psychotherapy, one of which, for example, relates 
to the independent-variable complex. Either it is true that more variables interact 
in individual therapeutic treatment than is customary in experimental psychology, 
or that we are at present unable to specify the independent variables which account 


for response to treatment. Thus a study of psychotherapy cannot rule out a host of 
disturbing variables in order to concentrate on the significance of one or two. There 
is no prospect of rising to the level of the rat psychologist’s control, where he gets rid 
of a great deal of genetic variability, for example, by drawing his animals from one 
purified strain. One of our concerns will be to state how, if at all, such special prob- 
lems modify the use of formal design in research on therapy. 


II 


It will clarify the issues if we distinguish four kinds of research, for convenience 
designated as technique research, survey research, administrative research, and 
critical research. 

Technique research is simply an attempt to get an instrument or operation for 
gathering data. Berg“? deals with some of the developments ledding to measures of 
therapeutic process or outcomes, and we need not be concerned with such studies 
here. Such work is preliminary to experimentation. 

Some research can be described as survey research. Survey research often comes 
very early in the development of a given area. At this stage we are relatively ignor- 
ant of possible relationships among our variables and may even lack any knowledge 
concerning the pertinent variables. In survey research we collect a lot of data on a 
lot of variables and then sit down and try to tease out some leads as to what might 
be important. At this stage we may rely on correlation techniques or on even more 
simple statistical methods such as counting the number of cases showing a certain 
behavior. 


*This paper originated in two separate efforts to state whether and how research on psychotherapy 
can profit from formal experimental design. One author prepared a statement of the advantages of 
experimental design; the other author developed, for the same symposium, the position that experi- 
mental designs had strictly limited usefulness. When, after presenting these statements before the 
Midwestern Psychological Association, April 27, 1951, the authors continued their discussion, they 
found themselves in basic agreement on both the’ positive argument and the warning regarding iimite- 
tions and misuse, and therefore have collaborated on this joint treatment. 
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To illustrate the kind of research for which we use the term survey, we may 
point to much of the early work in public opinion. Data were collected on age, sex, 
region of the country, economic status, religious background, political party affil- 
iation, and what have you, and attempts were then made to determine whether 
any of these variables might possibly be related to the responses obtained in opinion 
polls. Comparable work in psychotherapy is seen, for example, when someone studies 
the records of psychoanalytic cases to see what background differences may exist 
between the improved and unimproved patients. Survey research is a necessary pre- 
liminary mapping stage, and not to be disparaged. It is, however, emphatically pre- 
liminary to the more delicate and enlightening analysis an experiment offers. It is 
one thing to find superior socio-economic status correlated with academic achieve- 
ment, quite another and more difficult task to pursue this lead to more basic causes 
of the correspondence. 

A third kind of research problem is administrative or applied research (although 
admittedly survey research is often practical in aim). Applied research is concerned 
with obtaining an answer to one or more specific practical questions upon which, 
usually, some administrative decision is to be based. Countless examples of this sort 
of research may be found in the volume on Mass Communication by Hovland and 
others®). Administrative research often involves experimental design. 

An even greater dependence on experimental design characterizes the fourth 
type of research which, for want of a better name, we call critical research. Critical 
research is research also designed to answer one or more questions. The nature of 
the questions, however, stems from theoretical considerations rather than from 
practical considerations. In critical research, theory should indicate to us the nature 
of the answers to the questions we raise before the experiment is actually conducted. 
The research itself is primarily a check upon the answers to which theory has led us, 
or shows the correct one among several explicit answers possible within the theory. 

Studies of psychotherapy are usually intended to answer questions of the ad- 
ministrative or critical types, either to advise regarding a practical procedure or to 
unfold the causes of personality change. Such studies draw upon technique research 
for methods, and upon survey research for suggestions as to hypotheses. In each 
case, the investigation begins as a variant of the question: ‘Is Method A better than 
Method B?” If the question is stated so baldly, it is inappropriate for research.To 
obtain an answerable problem, the criterion must be carefully defined, the methods 
must be specified, and the range of persons and conditions to be considered must be 
identified. As one specifies these variables he makes it clear whether he is doing ad- 
ministrative research on a local problem or more generalized critical research. And 
this in turn influences the experimental design. 

Ill 

The clinical investigator is convinced that his problems are unusually complex, 
and this complexity is usually seen to lie in the large number of variables that may 
be relevant to a given problem. Perhaps some of the difficulties can be formulated in 
terms of three categories of variables: the stimulus variables, response variables, 
and organismic variables. There should be no confusion of meaning over our use of 
stimulus and response—if you prefer substitute situation and behavior. By an organ- 
ismic variable, we mean some property or attribute of the individual. Hair color, 
height, and age are convenient examples. Organismic variables may often be in- 
ferred from observations of previous response or from knowledge of previous exper- 
ience. For example, in a given research problem, intelligence level may be consid- 
ered an organismic variable, despite the fact that it is known from previous observa- 
tions of response. Other examples will come to mind: We can classify individuals as 
Democrats and Republicans, as schizoids and manics, as men or women, and so on. 
Lach of these, in a given investigation, may be an organismic variable. There would 
be no objection to the substitution of the term person variable for organismic variable. 
(Watson’s “patient variables” “” include variables which our approach forces us to 
divide between the organismic and response categories.) 
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Psychological research problems can now be viewed in terms of various per- 
mutations and combinations of one or more of these variables. To begin any research, 
the investigator first attempts to consider the variables which may be pertinent, 
including any in which he happens to be interested. He might then indulge in a little 
speculation and theorizing as to the nature of the relationships that exist among 
these variables. And before he begins any research he would, of course, have to face 
the very important problems of observing, recording, and if possible quantifying 
these variables. We should keep in mind, however, that some of the variables in 
which we are interested are not quantitative but instead qualitative, as for example 
when we are interested in two qualitatively different methods of teaching or in two 
methods of therapy. 

At this point we can examine past studies for their lessons to the therapist. The 
first problems to consider relate to the organismic variables. Educators spent a 
generation on studies of the oversimplified “‘Is A better than B?” type. They sought 
to settle by experimentation whether large classes were better than small, lectures 
better than laboratories, frequent tests better than few. Their studies led to endless 
contradiction because, as you will notice, the question did not specify the organismic 
variables. 

Now one way to get past this problem is by! delimitation. Suppose our investi- 
gator wishes to know which of two methods of teaching arithmetic is better. If we 
ask, we perhaps find that he will be content if he can answer the question as it relates 
to eighth graders. Granted unlimited resources, this might be investigated in a gen- 
eral way, but the question is not worth answering! For every study in education 
which has gathered sufficient data finds that the method which works best for some 
pupils is inferior for others. There is a thorough study by Brownell and Moser “? 
who gathered enough data (1400 cases in 4 communities) to demonstrate what a tiny 
batch of responses usually conceals. They compared (inter alia) a meaningful meth- 
od and a rote method for teaching borrowing in third grade subtraction. On the 
average the meaningful method was better; but in some schools the rote method 
proved superior because pupils had no general understanding of the number system, 
had never learned to reason in arithmetic, and could not follow the meanings pre- 
sented. Only by organizing their design so that cases were classified on this organ- 
ismic variable, i.e. nature of previous work in arithmetic, could Brownell and Moser 
get at the true relation. The relation had to be represented in the original design 
in the form of a stratifying or control variable. Another study by Anderson “°? 
showed that his Method A was better than B for bright, mediocre achievers, but 
that B was better for those of mediocre intelligence but good past achievement. The 
inclusion of both organismic variables in the design was essential if he was not to 
reach an oversimple, hence untrue, conclusion. 

It follows that simple delimitation is often not enough, except for some admin- 
istrative studies where we are content with an actuarial estimate that for the local 
sample one method tends to work better, on the whole, than the other. The proper 
proposal in the face of this difficulty is that cases should be described in terms of the 
largest possible number of organismic variables. Wherever possible these should be 
accurately measured. Second, the most promising (i.e., most likely to be relevant) 
pr pe gy be built into the design so that gains can be assessed separately for 
each variable. 


IV 


Let us now consider an experimental design which will permit not only an 
answer to a question concerning the overall effectiveness of two methods of therapy, 
but also to the question of whether there is any differential in response of two groups 
of patients (say, those with a high degree of initial disturbance and those with a low 
degree) to the two methods of therapy. This type of design is called a factorial de- 
sign. A factorial design is one in which we have two or more variables each varied 
in two or more ways and studied in all possible combinations. 
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If we use a factorial design for the problem at hand, the discrepancies in response 
between the two types of patients, if such discrepancies exist, will show up in a sig- 
nificant interaction. We set up a 2 x 2 table in which the top entries are Methods 
A and B. The side entries correspond to our high and low degrees of emotional dis- 
turbance. In each of the four cells of this table, we may enter the mean score on 
the response variable for a particular set of observations. 


MbrrHop 


High disturbance 
Low disturbance 1 


Now a significant interaction might show up in a number of different ways. For 
example, we might find, as our hypothetical entries suggest, that the mean score for 
the more disturbed patients is much the same regardless of the method of therapy. 
For the less disturbed patients, perhaps Method A is the more effective. A different 
significant interaction might be found if, for example, the more disturbed patients 
respond very well to one of the methods, whereas the less disturbed patients respond 
very well to the other method. Various other possibilities might produce a signi- 
ficant interaction also. A factorial design permits a test of significance of any inter- 
actions in the variables investigated, as well as a test of significance of the main 
effects. It is for this reason that factorial designs should prove to be extremely useful 
in research in psychology and psychotherapy and superior to a simpler matched- 
group comparison in which all Method A cases are treated together. 

This poses a new problem: how is the clinician to isolate effects among dozens of 
variables, many of them qualitative, when he usually has few cases to start with? 
If n variables are represented in the design, and each of these is represented as a 
dichotomy, then 2°x 2" cases are required for a complete design. Hence 5 control 
variables call for 64 cases, 2 each of 32 specified types, assuming that all main effects 
and higher-order interactions are to be investigated. 

There is some difference in the views of the writers here. Cronbach sees the 
number of relevant variables in the clinical study as likely to be so large that enough 
cases to account for them all will almost never be available. Edwards thinks a few 
well chosen organismic variables will clarify therapeutic conclusions and that in long 
range research the specified types to complete the cells of more complex factorial de- 
signs can be obtained. The writers agree that effort to isolate effects due to organ- 
ismic variables can have only a beneficial effect and that cases should be selected to 
represent as much variation as can be. It is far more valuable to study ten cases, 
two each of five identifiable subtypes, than to study a pool of fifty undescribed and 
undifferentiated people. 

The considerations that apply to organismic variables apply also to situational 
variables. It is well known that diagnoses vary with the hospital, and very likely so 
do the ways the treatments are applied. Educational studies have found it necessary 
to give constant attention to the interaction between the teacher’s feeling about a 
new method and his effectiveness in using it. Surely the therapist is a significant var- 
iable to be used in building the design. Witness Dressel and Matteson ©), who found 
that counselees who interpreted their own tests gained more than those who were 
told what the scores meant—but this result was entirely accounted for by differences 
between counselors rather than by any differences between methods. This could never 
have been known except for care in including several cases per counselor (two, as 
before, being the minimum to consider). 

It seems obvious at this point that simple comparisons of A vs B may often be 
relatively worthless, and that comparisons gain value as the design isolates the 
specific types of persons and situations for which A is superior. The investigator 
whose data are meagre should nonetheless organize them to search for such internal 
effects rather than test merely the overall difference, which, like any actuarial result, 
depends mostly on the composition of the sample. Jf it is necessary to restrict a 
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study to a single condition or situation, and to a few subjects, the results have max- 
imum value when the situation is described in detail and the subjects are selected 
to be homogeneous on as many organismic variables as possible. 

Vv 

Designs for research in psychotherapy call for more complex treatment of re- 
sponse variables than we have generally seen. Educational experience is again a case 
in point: studies that “prove” method A is superior to B may give different results if 
gains are evaluated in a different way. The classic idea of experimental design in- 
volves an experimental and control group which are compared on a single variable, 
but both the educator and the clinician are trying to alter a complex configuration 
in many ways. In evaluating guidance some investigator might propose to measure 
how the client feels about his problem after counseling. This is relevant evidence, 
but it is equally important to know if the client has learned new ways of thinking 
about himself that will help him solve later problems. 

The necessity of evaluating change in many dimensions is well shown by Brown- 
ell and Moser“), who found that limited evaluation resulted in prior investigators’ 
recommending what actually turned out to be the poorer of two methods. For years 
the evidence had piled up that people who learned to subtract by borrowing were 
slower than those who learned the so-called additive method. Brownell and Moser 
then taught children in Grade III in many schools by one of four methods: Borrow- 
ing-Rote, Borrowing-Meaningful, Additive-Rote, or Additive-Meaningful. The rote 
method meant that the skill was demonstrated and practiced, where the meaningful 
method put stress on analysis of reasons for the procedures. When they measured 
outcomes in terms of speed of subtraction after training, Additive-Rote was as good 
as either Borrowing method. Additive-Meaningful gave poor results because the 
method could not be grasped by children. When the familiar speed test was supple- 
mented by other measures, however, Borrowing-Meaningful came out well ahead of 
Additive or Borrowing-Rote. This was true for delayed tests to measure retention, 
and for tests requiring transfer of the method to new types of problems like sub- 
tracting fractions. No doubt about it, Borrowing-Meaningful was the method which 
best set the stage for further learning, but this would never have been discovered if 
speed of performance had been the only criterion. A sidelight is that this method 
worked best in schools where previous arithmetic was taught meaningfully; it was 
hard to put over any method to those children by rote because they kept insisting 
on explanations. And the meanings were hard to introduce where all previous work 
had been by rote. Does this suggest that one experience in nondirective therapy 
establishes readiness for more such therapy, but that one directive treatment makes 
response to nondirective less likely? These are the sorts of conditions that must be 
considered in explaining results. Note, though, that Brownell and Moser would 
never have nailed dawn their conclusions by formal experiment if they had not, from 
thinking, reached these conclusions before doing the experiment. 

Again one solution to the problem posed by multiple response variables is to 
delimit. The investigator is within his rights if he says he is concerned with one and 
only one type of change, leaving evaluation on the next variable to others. Unfort- 
unately some investigators choose a single index without clearly realizing that other 
measures of movement would yield different conclusions. Where the investigator 
does not overgeneralize, the reader may be the one who assumes that the single in- 
dex has demonstrated all-round improvement. There is an even stronger argument 
for multiple measurement. In research on counseling and therapy the costs of 
measurement are trivial compared to other effort in setting up and carrying out the 
investigation. Finding enough cases of a certain type is often so hard that only a 
single study can be performed. Keeping the subjects for a long term, making trans- 
criptions or extensive notes on therapy, and maintaining uniform conditions of treat- 
ment—these are so difficult and expensive that it is foolish to skimp on evaluation 
when a few more hours of testing are feasible. The information gained from an ex- 
periment mounts more or less in proportion to factorial n, where n is the number 
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of uncorrelated response variables. By this estimate 5 tests can report 120 times as 
much knowledge as a single test in the same investigation! 

Before examining how to treat multiple outcomes, we should mention another 
point, the importance of accuracy in assessment. When the outcomes are measured 
on a brief and unreliable test, or when subjective judgments of personality introduce 
inaccuracy, errors of measurement tend to obscure true differences. In this event, 
investigators are prone to accept the null hypothesis and not realize that a true differ- 
ence may be concealed by their inadequate technique. Effort to refine measurement 
has the same beneficial effect on the power of an investigation as adding to the num- 
ber of cases; the fanciest and largest study can be no better than the evaluating tools. 

VI 

Suppose an investigator is resolved to assess several response variables; how can 
his design be fitted to this intention? At this point the theory of experimental de- 
sign necessitates treating a study with five assessed outcomes as five separate in- 
vestigations, one for each variable.* Some investigators have tried to keep broad 
measures and yet stay within conventional statistics by pouring their data into a 
single overall index of adjustment. This is not recommended, for such an index blurs 
together the strengths and weaknesses of each method and provides no guide for 
improvement. Experience in predicting teacher success is a case in point. Hundreds 
of studies produced negligible correlations or contradictory results, so long as a global 
rating of success was the criterion. As soon as investigators went to more specific 
criteria which dealt with aspects of the teacher’s performance, they began to get ap- 
preciable validities. A criterion of the teacher’s rapport with pupils is predictable 
at a respectable level, where a mixed criterion lumping intellectual, emotional, and 
administrative contributions is not predictable. In therapy, an overall index is not 
a good criterion if the progress of a patient away from anxiety is concealed by nega- 
tive scores assigned for an increase in expressed aggression. 

The clinical investigator is frequently concerned with configurational changes: 
he wants to know if enhanced self-acceptance and greater obedience to social re- 
quirements go together in the same person, or whether the therapy increases one or 
the other. Either of these patterns could account for significant differences for the 
patients as a group in the two variables treated singly. Here an answer might be 
to define the configuration in advance as a pattern of two variables and measure that 
pattern as a single variable. The trouble is that the number of possibly important 
configurations is large, and that multiple significance tests raise the bogy of inflation 
of probabilities. 

If there is one significance test, a P of .05 will arise 5 times per 100 experiments. 
If there are five such tests in an experiment, such a P arises 25 times in 100 experi- 
ments. The more tests that are made the more likely it is that some differences really 
due to sampling variations will seem to be significant. This requires the investigator 
to become more conservative in his inferences, and so he must now regard with 
suspicion a borderline-significant result which he would ordinarily accept if only 
one test had been given. The writers have separately discussed this problem else- 
where“: ©. In effect, the introduction of more variables requires the investigator 
to use more cases to maintain the same sensitivity. 

Vil 

All the foregoing comments point in the same direction: to take advantage of 
experimental design the investigator must have a clear idea in advance of the in- 
vestigation as to what effects he expects to find. He must have some fairly sound 
insights as to what variables of all three types are relevant so that he can measure 
them and base the design on them. Wherever he ignores a relevant situation or 
organismic variable he increases his error term and hides significant effects. Where- 

*Methods permitting treatment of several variables at once are beginning to be developed, a 


notable example being the procedure for treatment of patterns proposed by Block, Levine, and Me- 
Nemar®), 
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ever he includes a variable which is irrelevant, he unnecessarily inflates his probabil- 
ities and sacrifices the possibility of using the same effort on a more relevant variable. 
There is a paradox in this. The person who understands his phenomenon thoroughly 
can produce the best experimental design, but the person who needs the experimental 
information is the one who does not understand his phenomenon. Thus the critical 
experiment, and even the applied experiment, depend for their effectiveness on the 
adequacy of the technique research and survey research which precedes them. The 
experiment. confirms or denies specific hypotheses, but the experimental design does 
not produce hypotheses. An experiment designed to nail down a certain suspected 
effect can, however, be used for exploration and identification of hypotheses for the 
next experiment. 

The statistical methods and formal tests of hypotheses are the tool of the 
cautious, tough-minded, hard-to-convince scientist. Every research worker has to 
have two personalities if be is to get the most good from his data. He must be the 
rigorous tester who believes nothing without conclusive evidence, when he is deciding 
what relations are to be admitted as proven facts. At this stage, he relies on signifi- 
cance tests and will not admit the validity of any hypothesis that does not yield a 
significant result. But if he proceeds to dismiss all such hypotheses, to conclude that 
“there is nothing in the idea’, he makes what statisticians call an error of the second 
kind. The naive observer makes “errors of the first kind’? by being too believing. 
The person who discards the unproven ideas makes the error of ignoring real relation- 
ships which his experiment is not powerful enough to bring out. So after the tough- 
minded half of the investigator’s personality has accepted what it will from the study, 
he must turn loose the inquiring, speculative, and tender-minded half which is willing 
to entertain doubtful ideas. If this tender-minded soul is gullible, believing in what 
has met no significance test, he will end with a science stuffed with superstitions. 
But if he holds these yet-unproven ideas in the air, as notions which may guide him 
in the next experiment or the treatment of the next patient, he is more likely to be 
correct than the man who casts the idea from his mind as soon as one experiment 
fails to provide significant confirmation. 

A genuine relationship may yield a non-significant difference for several reasons. 
One is that too few cases were used in testing it, so that sampling errors obscured a 
real difference. Second errors of measurement have a similar effect. Thirdly, even 
when a new technique (say, of therapy) is based on a superior concept, it is likely 
to be used inefficiently in its first trials, so that its advantage over other approaches 
will be obscured by technical faults in its application. These are the factors our 
tender-minded, but not therefore unscientific, investigator bears in mind. He stresses 
that ‘‘not statistically significant’”—like the Scotch verdict ‘‘not proven’’—permits 
us to return the hypothesis on trial to the arms of those who live it, rather than at 
once chopping off its head. 

Vill 

So firmly have some investigators been convinced of the merit of statistical 
control that they regard their work as finished when the formally stated hypothesis 
has been fitted with a statement that P is or is not greater than .05. At the extreme, 
one psychological journal has the official policy of discouraging discussion of in- 
dividual cases or other conditions not treated in the experimental design. But in 
research on psychotherapy, and in clinical problems generally, the time for such 
total reliance on formal analysis has not arrived. Once the significance tests are made 
on the hypotheses stated in advance, the investigator is ready to put his intelligence 
to work extracting new hypotheses from the data. The virtue of statistics, and their 
great defect, is that they are blind and unintelligent. They are therefore impartial 
and are not distracted by the broad scene; so they are marvelous for isolating and 
answering a narrow question. In a field where the significant narrow questions are 
unknown, we need explorers who go in with their eyes wide open, look at everything, 
and narrow their hypotheses in terms of what makes sense. Statistics never have to 
make sense. No amount of statistics could have done anything with the information 
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that poured into Darwin’s eyes on the voyage of the Beagle. Only after he had sifted 
it through a process much like clinical integration could biologists ask simple enough 
questions for statistics to be of help. Mathematical genetics is a later development, 
possible only after many important gross questions were cleared up. 

There is the hope of including many of the relevant stimulus and organismic 
variables in the experimental plan, but literally thousands of such variables can be 
recorded in a case history and a transcript of therapy. These data are unused so far 
as the formal design is concerned, and go to waste unless the investigator looks 
through them with his eyes open. If the investigator studies the cases where the 
treatment worked better than usual for others of the same supposed type, he may 
discover a new factor, previously ignored, that facilitates treatment. Conversely, 
when some method works poorly, the question is, ““ What went wrong?” At this point 
a study of the process of the therapy, or of the subject’s initial status, may show 
where difficulties and resistance came in. If the investigator can put his finger on a 
possible fault in the method, he is ready to do a new study in which the fault is cor- 
rected. Or he may find that conditions not under experimental control piled up in a 
way which handicapped the method under trial. If this happened, it would be im- 
portant to repeat the experiment even if the statistical analysis indicated the method 
was a poor one. (An example of this sort of inspection of relevant data not in the 
original design is found in a study of the effect of thiamine on human learning “?. 
With 37 experimental subjects, there happened to be eight cases of traumatic diffi- 
culty during the study: sinusitis, boils on the writing arm, vomiting, schizoid hallu- 
cination. Only one such interference appeared in the control group. The investiga- 
tor properly saw this as a “levelling effect’”’ to consider in interpretation.) 

Some students seem to think it highly questionable to use ‘intuitive’ (i-e., in- 
telligent) methods to analyze data once the statistical ritual is completed. Perhaps 
this timidity can be allayed by demonstrating how a topnotch statistician does re- 
search. The problem was one in agriculture, and the investigator no less than R. A. 
Fisher “>. He had a fine criterion, yield of wheat in bushels per acre. He found that 
after he controlled variety, and fertilizer, there was considerable variation from year 
to year. This variation had a slow up-and-down cycle over a seventy-year period. 
Now Fisher set himself on the trail of the residual variation. First he studied wheat 
records from other sections to see if they had the trend; they did not. He considered 
and ruled out rainfall as an explanation. Then he started reading the records of the 
plots and found weeds a possible factor. He considered the nature of each species 
of weed and found that the response of specific weed varieties to rainfall and cultiva- 
tion accounted for much of the cycle. But the large trends were not explained until 
he showed that the upsurge of weeds after 1875 coincided with a school-attendance 
act which removed cheap labor from the fields, and that another cycle coincided with 
the retirement of a superintendent who made weed removal his personal concern. 
Here we see a statistician accounting to his satisfaction for every systematic variation 
in his response variables, even if he has to consider the idiosyncracies of weed species 
and supervisors to do it. This is the way the clinical investigator should proceed if 
he is to get from the stage where he takes pride in any detectable difference to the 
stage where he can predict in advance the response of most patients to his treatment. 

Now it is highly dangerous, having searched one’s data for configurations and 
trends, to trust one’s conclusions. Fisher could trust his because he had many con- 
firming facts to support his final synthesis, and occasionally a clinician is in the same 
fortunate position. Because there are a huge variety of configurations in such a mass 
of data as the clinical researcher possesses, some plausible explanations will surely 
appear, but not all of them would be confirmed in a subsequent sample of cases. 
Hence what one arrives at by these intelligent explorations is a posteriori hypotheses 
in which the investigator may have faith but which require a new experiment before 
he can have scientific confidence. The experiment gives definitive results on the 
question it is designed to answer, but not on the questions that occur to you after 
you have peeked at the data. 
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IX 


The writers differ to some degree on their view of the place of experimental de- 
sign in therapy. Edwards is inclined to expect great advances to be made by identi- 
fying the major nomothetic organismic, stimulus, and response variables and then 
efficiently assessing therapeutic methods with formally designed experiments based 
on these. Cronbach is inclined to doubt that any limited number of nomothetic var- 
iables will account for much of the variation in response to therapy. We both feel 
that clinical research is now in need of good hypotheses to test, rather than of finer 
tests of hypotheses. But this is a question of timing, for early in a science finding 
hypotheses is primary; this exploratory stage has to be superseded by careful ex- 
perimental confirmation. 

On practical recommendations we are also agreed. Formal design has a major 
contribution to make when one has a good hypothesis. Even when the hypothesis is 
worthless, formal design organizes the data so that intelligent speculation can pro- 
ceed faster. We are further agreed that it is grossly wasteful to close up shop, having 
once confirmed one’s hypothesis, without examining all the additional variation in 
the results which the hypothesis left unexplained! The graduate student doing a 
study is usually very much concerned about his significance tests. If he can report 
that A is better than B (P < .05) he is ready to pack up his degree and go home. In 
fact, since this finding asks no new questions, that is all he can do. If instead, he 
treats his data with more intelligence and looks beyond his statistics to see what the 
case records show, he will come out of the study with more questions than he went 
in with. And the highest function of research is to help us ask better questions in 
our next study. 

By building a properly complex formal design around a limited number of pre- 
sumably important stimulus, organismic, and response variables, the investigator 
may ask a certain number of specific questions regarding their interactions. The 
formal design asks these in a very efficient manner. Then if the records contain add- 
itional facts about other variables, the failure to include them in the design means 
that the investigator has not asked a scientifically answerable question regarding 
them. The exploratory phase which follows the planned significance test weighs 
these added variables to see if any of them, alone or in combination, seem important. 
Thus the exploratory phase of research is trying to find out what questions the next 
experiment should ask. Exploration and intelligent analysis are no substitute, how- 
ever, for that next experiment which gives scientific legitimacy to the relationship 
suspected. 
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tesearch on the evaluation of psychotherapy is a task fraught with many diffi- 
culties. Although any serious investigator would be forced to agree with this state- 
ment, the very recognition of the fact has seemingly had a paralyzing effect. Granted 
that the available tools and concepts are inadequate to capture the full flavor of 
therapeutic effort, investigation has been deterred all out of proportion to the serious- 
ness of the existing state of affairs. Before proceeding very long, present efforts will 
be seen as inadequate and stumbling, but this should not prevent fostering the be- 
ginning which has been made. 

In planning research on the effectiveness of psychotherapy a considerable var- 
iety of defining and delimiting characteristics must be considered. The most ob- 
vious characteristic to be defined is the method whereby a theory of dynamics is 
integrated with appropriate techniques, attitudes and practices of the psychother- 
apist. Among the more common approaches, as these integrations may be called, 
are the psychoanalytic, the relationship, the client-centered, and the psychobiologi- 
cal. These approaches, in turn, become proliferated. For example, the practice of 
psychoanalytic therapy shows many nuances of systematic differences. Varying de- 
grees of unorthodoxy appear in the work of Schilder, Alexander, Sullivan, Fromm- 
Reichmann and Horney, to name but a few practitioners in the United States. Even 
the approaches to psychoanalytic therapy of the various training centers, such as the 
New York, the Washington, and the Chicago Psychoanalytic Institutes, differ 
among themselves despite a considerable core of agreement. When different thera- 
peutic approaches are compared, the problem becomes even more complicated. Differ- 
ences arising from the skill, particular orientation, idiosyncratic theoretical pre- 
dilections, and personality of the individual therapist would also appear to be an- 
other group of relevant variables. In some therapists these individual variations may 
be so important as to overshadow the influence of the avowed systematic approach 
adopted and result in what can only be called a unique approach. The patient and 
his personality characteristics as well as the particular manifestations of symptom 
and structure which bring him to the therapist must also be considered and specified 
in evaluating the effectiveness of psychotherapy. The patient’s background and pre- 
vious experiences, which in a specific instance may or may not have contributed to 
his difficulties, form another significant group of defining and delimiting character- 
istics. The type of organization supplying therapeutic service must also be con- 
sidered, as in the distinctions among state hospital inpatient services, child guid- 
ance clinies, and college counseling services. When mediating adjuncts such as bib- 
liotherapy or occupational therapy, or when environmental manipulation or modifi- 
cation are used to supplement direct psychotherapy, measurement of the effective- 
ness becomes still more complicated because of the necessity of separating the 
effects of these influences from psychotherapy proper. Somatic intervention, such as 
shock therapy or lobotomy, may also complicate attempts at evaluating psycho- 
therapy. In view of such distinctions it would be futile to mix indiscriminately all 
these ingredients in a crucible and hope to distill only one element—the essence of 
psychotherapy. 

Hereafter, the effectiveness of psychotherapy as such will be considered. As a 
consequence the complicating effects of mediating adjuncts, environmental man- 
ipulation or modification, and somatic treatment will not be further discussed. This 
is justified since they are not crucial to investigation of the effects of psychotherapy 
providing they are not-used simultaneously with the patients in question. Eventually 
the differential efficacy of the various therapeutic approaches will assume promin- 
ence as a research problem. Study of the differential cffeets of the systematic ap- 
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proaches with special reference to the delimiting variables—syndrome, number of 
patients simultaneously seen in therapy, age, skill of the therapist, type of organiza- 
tion supplying service, etc.—will also become possible ultimately. Since so little is 
known about methods and designs for measuring the effectiveness of any sort of 
therapy, it would appear most appropriate to deal at present with the relatively 
simpler and logically prior problem of effectiveness within as sharply defined circum- 
stances as our data and acumen permit. 

What kinds of data are needed in evaluating the effectiveness of psychotherapy? 
Keeping in mind the limitations just mentioned, the following classification of the 
variables needed to define such research is suggested. Information is needed about 
the patient himself, his environmental background, the therapeutic situation, and 
the therapist. In other words, data need be obtained on variables (1) intrinsic to the 
individual (patient variables), (2) related both currently and historically to his life 
(situational variables), (3) descriptive of the therapy (therapeutic variables), and 
(4) descriptive of the therapist (therapist variables). The same study may devote 
some thought to more than one of these variables, but no published study has simul- 
taneously considered all of them within the same group of patients. All research in 
this field needs at least specification and identification of these variables for adequate 
interpretation of findings. 

At this point there will be briefly reviewed the four classes of variables—thera- 
peutic, situational, therapist, and patient, in that order—by stating some of the 
methods used in the evaluation of psychotherapy and some of the illustrative prob- 
lems faced. 

It is in the area of the process of therapy as such that most research deserving 
the name has been done. The work of nondirective psychotherapists is noteworthy. 
The studies of Snyder“, Seeman“, and Raimy “ are concerned with changes from 
earlier and later therapeutic sessions and are not so designed as to bring out the 
effectiveness of therapy outside the therapeutic situation. With this emphasis they 
supply information about the therapeutic variable as such. Other approaches, such 
as the discomfort-relief quotient and the Movement Scale of Hunt and his asso- 
ciates “: ®), also fall in this category. 

The situational variables have been most extensively studied by social workers. 
Concerned with such problems as broken homes, position in the family constellation, 
parental attitudes and related background factors, they have attempted to tease 
out the influence of these upon therapy and thereby their influence upon its effect- 
iveness. Extra-therapeutic interviews and analysis of therapeutic case notes have 
been the principal sources of data used. Further quantification and objectification 
are needed here as well. 

The orientation of the therapist is a product not only of his professed theoretical 
position, as investigated by Fiedler“: ?: *) but also his own personality characteristics 
including his biases and blind spots. Psychotherapy is a process conducive to emo- 
tional involvement on the part of both therapist and patient. Accordingly we need 
to know the therapist’s reaction to the patient. How does he view him? Does he like 
him? Is he bored or hostile? Ego-involvement with the patient is sometimes a very 
prominent and pertinent feature against which to evaluate psychotherapeutic 
effectiveness. The phenomena of transference and counter-transference testify to its 
importance in evaluating psychotherapy. When one’s hopes and fears are intimately 
and inextricably bound to the process, as is the case in psychotherapy, it should be 
no occasion for surprises that sometimes the usual narrative reports of psycho- 
therapy often tell us more about the personal and theoretical orientation of the ther- 
apist than they do about the patient. How to face this problem adequately is also 
an issue to be solved. 

Not only will the personality of the therapist affect the very process of therapy, 
but also it will influence the outcomes of the research as such. In many investiga- 
tions of psychotherapy some of the participating therapists are only indirectly or 
peripherally related to the research. Sometimes they are disturbed by their partici- 
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pation, distrustful of the value of the study, or even negativistic concerning the whole 
project. This varies according to their background, relationship with the research 
workers, and kind and degree of preparation for the project. Although this is by no 
means entirely a matter of research design, it is obvious that some method of speci- 
fication of their attitudes and feelings must be included since they very directly 
affect the results obtained. 

Methods used or potentially open to use in evaluating the orientation and 
personality of the therapist include analysis of the therapeutic interviews (either 
through anecdotes, case notes, or transcribed therapeutic interviews), observation 
of the behavior of the therapist in and out of the therapy sessions, questionnaires, 
Q sort of statements concerning therapeutic relationships, tests, autobiographical 
statements, rating scales, and interviews with the therapist. 

To evaluate the effectiveness of psychotherapy in terms of the patient variable 
we need information concerning the way in which the patient experiences the thera- 
peutic relationship. As Rogers? says, the therapeutic relationship as experienced 
by the patient is a field of inquiry which is new and one for which research is non- 
existent. The nearest approach to research that we have are accounts of how the 
patient felt about therapy when asked directly and inferences made by the clinician 
from transcripts or case notes. How to objectify these materials and yet to deal with 
them meaningfully is still another problem. 

Other aspects of the patient variable have been investigated through a variety 
of techniques. Questionnaires and rating scales filled out by therapists, patients, 
friends of the patient, or others are used. These devices suffer from the well-known 
deficiencies of such structured self-report techniques. When the patient is asked 
about his own tendencies in an undisguised form, all that can be expected is his own 
conscious estimation of what he is like along with what he considers it politic to give. 
In the hands of the therapist, since presumably his training and experience permit 
penetrating beyond mere suface phenomena, these techniques have more promise. 
Nevertheless, difficulties are encountered, discussed in connection with the therapist 
variable. Interview transcript recordings and case folder summaries form another 
category of data used for analysis of the patient variable. For example, the client- 
centered therapist’s concern with the patient’s capacity for self-initiated con- 
structive behavior is a research interest so far confined to investigations of the very 
materials of therapy itself. The studies of categorization of therapist-patient res- 
ponses such as those of Snyder®? and Raimy “?, although primarily directed to other 
aspects of psychotherapy, supply information permitting evaluation of effectiveness 
in terms of the patient variable. 

Primarily in connection with the patient variable it is important in research on 
psychotherapy to preserve as normal a clinic routine as is possible. If the patient 
realizes he is a subject of research the situation changes probably in a subtle but 
nonetheless important fashion. How to preserve normal clinic routine and yet to 
do the intensive and extensive research is still another problem worthy of attention. 

In evaluation of psychotherapy the patient and his personality characteristics 
loom large. It is in the area of diagnostic appraisal that the clinical psychologist 
has made his most extensive professional contribution. He is in a position to supply 
much information about the past experience, present functioning, and future develop- 
ment of the individual patient. It is in the grouping of patients for purposes of 
specification of common characteristics that the greatest difficulty is encountered. 
In other words, defining the sample, whether by diagnostic label or other means, is 
indeed difficult. Diagnostic nomenclature, so far as grouping like with like is con- 
cerned, is in a rather sorry state with the textbook cases hardly ever appearing in 
the clinics. So even the specification of the nature of the patient group is by no means 
an easy task. This, too, is a problem which plagues research workers. 

In considering the patient variable, the criteria against which to measure the 
effectiveness of psychotherapy can hardly be an unexpected problem to offer for dis- 
cussion. In the course of reading in the literature, upwards of a hundred criteria, 
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used singly and in combination, have been found. Among them are the disappear- 
ance of symptomatology; reorganization of personality structure; vocational, avo- 
‘ational, family and individual “‘adjustment’’; restoration of memory of the primal 
scene; testimony of the patient as to his satisfaction with results; ability to leave the 
hospital; deeper self-understanding; and so on ad nauseum. Selection of a single cri- 
terion or of a limited number of criteria is easy but may baldly or subtly reflect little 
more than the bias of the investigator. The brutal fact is that we have no common 
agreement as to what are the crucial criteria of effectiveness of psychotherapy. All 
of those here mentioned may represent facets of the valid criterion—or they may not. 

The way in which these and other criteria are employed for specifying the nature 
and degree of change in the patient run the gamut of the methods and theories held 
by the clinician. Therapist’s opinion, narrative statement, analysis of transcripts 
and of case notes, ratings by both patient and therapist, tests, questionnaires, 
physiological measures, post-therapy interviews of the patient or others, have all 
been used. 

Too often the results of research studies, especially in the psychiatric literature, 
are stated in terms of the therapist’s opinion in such verbal categories as “‘recov- 
ered,” “improved,” “no change,” and “worse,” with little or nothing in the way of 
operational specification. And yet, these opinions of experts are important. The 
need is to sharpen and make precise and replicable such opinions. 

Test results are an important and commonly used device. In connection with 
psychological tests, we have perhaps overemphasized the development and use of 
cross-sectional devices which tap many varied and interrelated aspects of personality 
structure and function at the expense of the development of instruments which 
faithfully reflect change in the individual. Stability of test scores is all very well, 
unless we have to pay the price of lack of sensitivity to actual changes in the person. 
Tests should not only be validated by showing they are related to the criterion at 
one temporal point, but also that in the face of varying circumstances making the 
person different the test results keep pace. 

Stressed up to this point have been the objectification and manipulation of re- 
search data on effectiveness of psychotherapy in terms of four groups of variables. 
Actually the most important problem is the reconciliation of the potentially avail- 
able objective material with the fluid dynamic relationship which is therapy in 
operation. 

As in other aspects of clinical research one may identify two extreme positions 
taken by those concerned with therapy. At one extreme of the continuum there are 
individuals who demand scientific precision at any cost; at the other there are those 
who would say that psychotherapy with a given patient is so unique, personal and 
artistic a process as to make attempts at measurement futile and absurd. This 
skepticism is brought about not so much because objective research leads to in- 
correct findings, but because the findings are considered trivial and often mere veri- 
fications of phenomena already established by clinical experience. If the results ap- 
pear to contradict such experience, they have no hesitation in rejecting them. Al- 
though most individuals do not take either extreme position, it would probably be 
fair to say that more psychotherapists, especially psychiatrists and social workers, 
hold to an artistic rather than to a scientific view of the matter. Part of the antip- 
athy to research in psychotherapy that they demonstrate may be attributed to their 
lack of training in research methodology. Nevertheless, it is suspected that this by 
no means accounts for it entirely. After all, their aggregate experience in the practice 
of psychotherapy is infinitely greater than that of the research-minded. Even some 
psychologists trained in research share their skepticism. 

Whatever position one may now take, ultimately reconciliation and integration 
must be achieved. Research efforts should take advantage of the insights and posi- 
tive values of both approaches and strive for the development of techniques which 
will facilitate this reconciliation. Personal bias will become even more clear when it 
is stated that the crucial problem is seen as the objectification of data obtained in 
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the course of therapy without sacrificing their dynamic character. In connection 
with narrative materials new measures are desperately needed which preserve the 
spirit and flow of the therapeutic relationship without exclusive dependence upon 
the unique skill of a particular clinician. Narrative case notes and even transcripts 
are seen as valuable by those who object to the thought of objective research in 
psychotherapy. Perhaps here is the point at which reconciliation and integration 
may come about. 

It would appear that we shall need methods of analysis on the same patients 
of narrative case notes and transcripts on the one hand, and tests, rating scales, 
physiological indices, etc., on the other. How can we analyse both realms of data 
on the same patients at the same time? If we cannot do so, much of our objective 
results will be ignored by many practicing psychotherapists. Furthermore these 
qualitative data will continue to present a legitimate challenge to the validity of the 
findings, not because the results obtained by objective means are necessarily untrue 
but because they are woefully incomplete. The intertwined multi-faceted subjective 
nature of man’s experience in psychotherapy, cannot be neglected if research on the 
effectiveness of psychotherapy is to reach its full stature. 
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INTRODUCTION 


For some time the authors and their colleagues have been engaged in a research 
program aimed at investigating the potentialities of abbreviated individual intelli- 
gence tests with particular reference to the problems of neuropsychiatric screen- 
ing @- 4,5, 6, 7, 8,9, 10, 11, 12, 13,14) Our tests have been designed to meet two criteria— 
that they should serve within the limits of their brevity as adequate measures of 
intelligence, and that they also should serve as rough diagnostic screens to indicate 
those individuals in need of further, intensive examination for possible psychopatho- 
sis. The majority of tests investigated were verbal in nature, but several non-verbal 
tests were tried © !, While the performance of these non-verbal tests was in general 
adequate, we decided to attempt the development of a new non-verbal test better 
adapted to our particular needs which we felt were somewhat different from the usual 
ones governing the development of non-verbal tests. 

Ordinarily the non-verbal test is viewed as a supplement or complement to the 
verbal one and is used to extend the range of manifestations of intelligence being 
measured and hence to offer a fuller, more comprehensive measure. For purposes of 
military screening we desired a non-verbal test to substitute for, rather than to 
complement, verbal tests. It was intended for use in cases of cultural, educational 
and language handicap where a verbal test would not be suitable but where the 
measure obtained by the non-verbal test should be comparable, if possible, to that 
obtained on other recruits with a verbal test. This requires as high agreement as 
possible with verbal criteria, in our case the CVS abbreviated individual intelligence 
scale: 7,1) and the Navy General Classification Test (GCT). We also desired the 
test to be sensitive to psychopathology both when it was the sole test administered 
and when it was administered along with a vocabulary test to afford a conventional 
“scatter” measure. A further motive was our feeling that by experimenting with a 
new type of test we might learn more about the basic problem of differential diag- 
nostic performance and the underlying pathological factors contributing to it. 


Tue TrEst 


After careful consideration, we decided that a non-verbal test combining space 
perception and the eduction of logical relationships promised most for our purposes. 
To these we wished to add recent memory as a factor, since it is known to be sensi- 
tive to many types of psychopathosis, particularly those of organic origin. The type 
of item finally decided upon for the Navy-Northwestern Matrices Test (NNMT) 
consists of a series of three stimulus cards each presented separately and successively, 
and withdrawn after a two second interval. Each of the three stimulus cards con- 
tains on it one part of a symmetrical, four part design. The subject is then shown a 
multiple choice strip (response card) containing five alternatives, one of which is the 
correct fourth segment necessary to complete the design, and he is asked to point to 
the right response. In using a multiple choice technique we felt that by controlling 
the types of error offered the subject we might increase the discriminative ability of 
the test by a differential evaluation of the various errors involved. Each response 
card, therefore, contains in addition to the correct segment needed to complete the 
design, four erroneous alternatives as follows: (1) a design segment correct in form 


1This study is part of a larger project subsidized by the Office of Navel Research under their policy 
of encouraging basic research. The opinions expressed, however, are those of the individual authors 
and do not represent the opinions or policy of the Naval service. 

*Currently with the Personnel Research Laboratory, HRRC, Lackland Air Force Base, San An- 
tonio, Texas. 
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but incorrectly placed spatially; (2) a design segment incorrect in form but correctly 
placed spatially; (3) a segment which duplicates in form and placement one of the 
three stimulus segments previously shown; and (4) a segment which is completely 
irrelevant to the total design. All five alternatives are placed on the response card in 
random order, but each response card contains all four types of error. We had hoped 
that the first two, since they isolate symbol learning and spatial orientation, might 
show some differential relationship to intelligence; and that the third, involving per- 
severation through the choice of a previously exposed figure, and the fourth, involv- 
ing complete irrelevance to the task at hand, might offer diagnostic possibilities in 
pathological cases. While our original hopes have not been completely fulfilled,.our 
results support the value of such a controlled-error, multiple-choice technique. 
Figure 1 shows a typical item from the final form of the NNMT. The figures are 
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Figure 1. Sample Item from NNMT. 























printed on cardboard strips 244" x 11144” which are bound together by a plastic ring 
binder, making it possible to expose the cards by turning the pages. Since the print- 
ing is on one side of the strip only, as a strip is turned the previous design segment 
disappears as the next one appears. The three stimulus strips for the design each 
contain a squared space 2144” x 214” (the stimulus “card’’) containing one element 
of the design. The response strip contains five squares or ‘‘cards” side by side show- 
ing the correct response and the four erroneous alternatives. In the item shown in 
Figure 1 the first strip contains a square or “card”’ having a half-moon (facing right) 
in the left-hand portion of the square. The second strip shows a “card”’ containing a 
triangular figure in the upper middle portion of the square. The third strip shows a 
“card”? with a half-moon (facing left) in the right-hand portion of the square. The 
response card contains a square with a circle on it (irrelevant error); a square with a 
half-moon facing right in the left-hand portion (perseverative error); a square with a 
triangular figure in the lower middle portion (correct response); a square with an 
incorrect figure in the lower middle portion (incorrect figure spatially correct); and 
a square with a triangular figure in the upper middle portion (correct figure spatially 
incorrect). All the other items are constructed on the same principle. 


MerHop or PRESENTATION 


It was possible to present the stimulus segments in either a “logical’’ order; i. e., 
right to left, clockwise, top to bottom, ete., or in a “‘mixed up” order in which none 
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of these progressive relationships were observed. Because we thought the order of 
presentation might affect the difficulty or discriminative ability of the item, we in- 
vestigated both types of presentation. Since there were no clear differences in diffi- 
culty between the two methods (although individual items did vary), we have finally 
included items in which the design develops in both fashions—in a logical order, and 
in a mixed-up order. 

While our validation and standardization groups are perhaps sufficiently large 
to merit circulation of the test at this time, we still consider it to be in an experi- 
mental stage of development. Our experience of the past five years has convinced us 
that too often tests tend to be circulated for use somewhat prematurely, before the 
test constructors are fully acquainted with the potentialities as well as the peculiar- 
ities of their test. Since our results to date, however, are quite interesting and raise 
many suggestive and stimulating points in connection with diagnostic test develop- 
ment, we feel that a preliminary report is justified. 

The test is designed for individual administration. There is no time limit for the 
responses and the subject is expected and encouraged to respond to each item so that 
types of error can be noted as well as the total score of correct responses. Great care 
is taken to see that the subject understands the task before starting the actual test. 
The score is the number of correct answers. Two preliminary sample items are gone 
over with the subject before the fifteen “‘test’’ items are begun. The following in- 
structions are used: 

“In this test you are to finish a design. First I will show you, one at a time, 
three parts of a design. The design needs a fourth part to finish it. Then I will show 
you a card from which you can pick the missing part. Let’s try this sample.”” Show 
first sample. ‘Here is the first part, the second, the third. Now see if you can find 
the fourth part from this card.” Pause to allow subject to choose. If he selects the 
right one, say “‘ Yes, that is right because it makes the fourth corner of the design.”’ 
If he fails, say ‘‘ No, it couldn’t be that one because: ‘It is in the wrong place’, ‘It 
faces the wrong way’, ” etc. If he fails to understand, show him the stimulus cards 
again and show him how the correct answer fits. “‘ Now let’s try another sample.” 
Show second sample. Repeat as with first. Say, “‘ Yes, that’s right because it fills in 
the bottom half of the design between the two crescents’, or ‘‘ No, it couldn’t be that 
one because...’ If he fails to understand, do as with the first sample. ‘Now do you 
know what you are to do? If you aren’t sure, ask me now because I can’t answer 
any questions after we start.” Pause for questions. “‘ Ready for No. 1?” 

Originally we started with thirty items. At this time the “strip” presentation 
represented in Figure 1 was not in use. Instead each stimulus square was presented 
as a separate 24” x 2144” card rather than as a 2144” x 214” square marked off on a 
2144". x 114" strip. (As we shall see later, the results are the same in either case.) 
Preliminary investigation with 562 college undergraduates tested in several large 
groups with the aid of a projection lattern, and 86 graduate students and 201 Naval 
recruits tested individually revealed that the test did not discriminate satisfactorily 
within the superior college population where the distribution curve of the scores 
showed extreme negative skewing, but did discriminate well among the recruit 
group. With the recruit group a normal distribution covering the entire range of 
possible scores was obtained as were satisfactory correlation coefficients between the 
test and the Navy GCT and the CVS battery. An item analysis was then done on 
the recruit data using both the GCT and the CVS as criteria. The method of Davis“? 
was applied for this analysis and items with a discrimination index falling below his 
suggested minimum of 21 when either of the criteria was used were automatically 
rejected. The remaining items were evaluated in terms of their agreement with both 
criteria and their level of difficulty, and the best 15 were selected for the final form of 
the test. Cross validation and standardization of this final form was carried out on 
two new samples totaling 445 Naval recruits from the Great Lakes Naval Training 
Center. Each recruit was tested individually. ; 
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VALIDATION 


The first standardization group of recruits numbered 337. They ranged in age 
from 17 to 26 with a mean of 17.84. They had all been given the GCT and their 
scores ranged from 26 to 74 with a mean of 54.19 and a standard deviation of 8.26. 
The CVS abbreviated individual scale also was administered at the time of presenta- 
tion of the NNMT and the CVS scores ranged from 22 to 46. The mean was 30.58 
and the standard deviation was 5.58. The NNMT results on this first group are pre- 
sented in the first part of Table 1. The range of scores was from 0 to 15 with the dis- 


TaBLe 1. VALiIpationar, Data FoR NNMT 
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tribution being approximately normal. The mean score was 8.29 and the standard 
deviation 3.79. The NNMT correlated with the GCT .64 and with CVS .64. Keep- 
ing in mind the greater stability as well as greater diagnostic possibilities of a battery 
of tests, we also obtained the combined scores of Vocabulary (from CVS) and NNMT 
and of CVS and NNMT (using standard scores for NNMT derived by Wechsler’s 
procedure as we had previously done with CVS), and correlated these with the GCT 
scores. The results also appear in Table 1. The V-NNMT and CVS Sa MT means 
were respectively 19.72 and 40.37 with standard deviations of 4.9 and 7.8. V-NNMT 
correlated with GCT .77 and CVS-NNMT correlated. 83. For ese he the cor- 
relation of CVS with GCT for this group was .70. In view of these findings we con- 
cluded that the NNMT performed satisfactorily when added to V or CVS to con- 
stitute a battery. 


Irem ANALYSIS 


After establishing the potential usefulness of the test as a whole, we re-examined 
the individual items. As before, item analyses were done with both GCT and CVS 
as criteria. The results appear in Table 2. Discrimination indices ranged from 11 to 
48 when GCT was the criterion and from 20 to 53 when CVS was the criterion. Item 
2 fell below the minimum requirements when GCT was the criterion, but it was 
satisfactory with CVS. The mean discrimination index is 30.05 with GCT and 34.87 
with CVS. To account for the fact that the indices vary when the correlations be- 
tween the total test score and the two criteria are the same, it must be remembered 
that the indices are based on the top 27 per cent and bottom 27 per cent of the cases 
while the r is based on the total sample. Some variation might be expected. 
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TABLE 2. Item ANALYstIs Data FoR NNMT 








Group 1 Group 2 
Item (N==337) (N—108) Total 


Group 1 Group 2 
(N=337) (N=108) Total 


30 34 44 42 45 
26 19 29 28 28 
43 43 50 53 51 
29 30 29 26 31 
27 23 | 37 39 37 
40 48 28 30 28 
34 31 53 46 47 
26 29 36 34 34 
34 30 36 42 37 
23 35 35 36 36 
15 26 : 25 26 
29 28 ‘ 19 28 
19 21 i 36 39 
45 40 | d 43 40 
37 33 , 28 29 
29.80 31.33 34.87 35.13 35.73 
8.71 - 8.91 7.42 


| Discrimination Index GCT | Discrimination Index CVS 
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The order of difficulty for the items was then determined. In order to test the 
stability of this order, several successive samples of 50 each were selected randomly 
and also checked. The difficulty order for each sample was determined and the valid- 
ity of rank order coefficient obtained following Eysenck’s suggested method ©. This 
coefficient proved to be .97. The order is given in Table 3. It will be apparent im- 
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mediately that the items were not arranged in order of difficulty. No rearrangement 
was made for several reasons. We were unable to predict what influence such a 
change would have on the validity or difficulty of the item. Also, since all items 
should be administered to all subjects if full use is to be made of the test, it would not 
necessarily be bad for the order of difficulty to vary and this might even be of value 
in preventing the discouragement incurred by a succession of difficult items. 

Thus far in our investigation we had used hand-drawn items. It now appeared 
that a more permanent form of the test was justified and copies were printed and 
bound using the “strip” format previously mentioned and illustrated in Figure 1. 
In order to make sure that this change caused no essential change in the perform- 
ance of the test, a new sample of 108 recruits were tested. The results appear in the 
second part of Table 1. The GCT mean of 54.40 and the CVS mean of 30.97 are 
comparable to those reported for the earlier sample. The NNMT mean of 8.38 is 
also not significantly different from the 8.29 previously reported. Finally, the two 
battery means of 41.86 for CVS-NNMT and 20.63 for V-NNMT do not differ be- 
yond chance expectancy from the earlier ones. Correlations between NNMT and 
CVS (.48) and NNMT and GCT (.58) are slightly lower, particularly the former. 
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However, some attrition is to be expected with successive sampling and the correla- 
tions were still considered satisfactory. The correlations between the two batteries 
and the GCT remained approximately the same, .78 and .76 for CVS-NNMT and 
V-NNMT respectively. 

The data on item validity for this sample also are presented in Table 2. With 
GCT as the criterion, the range is from 15 to 45 with a mean of 29.80. This is com- 
parable to the previous sample. Comparing item for item we find reasonable stabil- 
ity except for item 11 which falls below the acceptable level in this sample. The mean 
index for CVS as criterion is 35.13 and again we find item for item comparison satis- 
factory, this time including item 11. 

The order of item difficulty is given in Table 3. The rank order correlation be- 
tween the two samples is .63. Closer inspection reveals that here again item 11 has 
shifted positions. A rank order coefficient obtained with item 11 omitted is .79. 
The foregoing results indicate not only that the printed form was comparable to the 
preliminary hand-drawn form, but that the test demonstrated satisfactory stability 
over two samplings. The data were then combined to give a total standardization 
group numbering 445. The results for this entire group appear in the final parts of 
Tables 1, 2, and 3. 

SCORING 

The standard scores for this final form of the NNMT, which were based on the 
entire group of 445, and which were used in combining the NNMT with the other 
tests to obtain a battery, are presented in Table 4. 

TABLE 4. Sranparp Scores ror NNMT 


Standard Score 
Raw Score Equivalent 
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M=8.32 M=9.80 
SD=3.81 SD=2.94 





RESULTS WITH CLINICAL GROUPS 

We then extended our sampling to include a wider age range. Both the NNMT 
and the CVS were administered to 104 individuals between the ages of 20 and 55. 
Of these, 65 were firemen with an age range from 22 to 55 and a mean of 42.68 and 
standard deviation of 8.52. The other 39 were women, some of them arthritic pat- 
ients and the rest non-arthritics matched with them for age and marital status. 
Since there was no difference on the NNMT between them, these female groups are 
reported as one. The age range was from 20 to 43 with a mean of 32.26 and standard 
deviation of 7.93. Both these male and female groups were presumably normal 
without any obvious indication of functional or organic mental disease.’ 


’Thanks are due Mr. Joseph Matarazzo for the data on the firemen and Dr. Harold Klehr for the 
data on the arthritic patients and their controls. 
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The data are reported in Table 5. The mean CVS scores for these older male and 
female subjects were 35.57 and 34.23 respectively, as compared with a CVS mean 


TaBLe 5. NNMT Data For OupER SupJEcts 











Firemen Female Total 
(N65) (N==39) (N==104) 





Age range 5i 20-43 20-55 
Mean .68 32.26 38.89 
SD .! 7.93 
CVS range ; 29-45 
Mean 35.57 34.23 
SD | 5.5: | 5.18 
NNMT range i 0-14 
Mean 5.7! } 6.08 
SD 35 














Correlations 
CVS-NNMT 4 .34 
Age-NNMT oO 1 —.45 
Age-CVS 15 ). 
CVS-NN MT-Age 
NNMT-Age-CVS 
CVS-Age-NNMT 





of 30.8 for the younger recruits. Despite the higher CVS scores, the mean scores of 
the NNMT were 5.75 and 6.08 respectively, both lower than the mean score of 8.32 
obtained on the NNMT by the recruits. The correlations between CVS and NNMT, 
.40 and .34 respectively, were also lower than those previously reported with recruit 
groups. It would appear that the NNMT is adversely effected by age. 

In order to check this hypothesis the NNMT scores were correlated with age 
yielding coefficients of —.54 and —.45 respectively, and of —.48 when the two groups 
were combined. Partial coefficients were then obtained for the combined groups. 
With age held constant the correlation between CVS and NNMT was .49. With 
CVS held constant the correlation between age and the NNMT was -.51. With the 
NNMT partialled out, the correlation between CVS and age was .19. It is obvious 
that the NNMT is adversely influenced by aging. With the older groups there is a 
distinct lowering of the mean NNMT score, the correlation with CVS is disturbed, 
and a high negative correlation with age appears which is not typical of CVS in the 
same group. The fact that the NNMT seems unduly sensitive to age encouraged us 
since it might indicate that, as we had hoped for, we had succeeded in constructing 
a test which was peculiarly sensitive to organic deficit. 

We next administered the NN MT to some 190 clinical cases; 40 schizophrenics, 
50 paretics, and 100 mental defectives. The data are reported in Table 6. The 


TaBiE 6. NNMT Data For THE Cuinicat GRovrs 











Schizophrenics Paretics Mental Defectives 
(N40) (N=50) (N=100) 
Age R 22-43 33-55 16-30 





} 30.76 | 48.38 20.46 
SD 7.89 5.92 3.10 
CVS R 6-47 440 4-22 
M 26.33 17.36 12.24 
SD | 10.38 7.36 4.74 
NNMTR 0-13 0-5 0-8 
M 4.32 2.22 2.90 
SD 3.18 1.39 1.83 
Correlations 
CVS-NNMT 65 O01 Pe 
Age-NNMT ~-.09 .02 .07 
Age-CVS .09 .04 .02 
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schizophrenics ranged in age from 22 to 43 with a mean of 30.76 and standard de- 
viation of 7.89. While their age is comparable to our previously mentioned female 
group, the mean scores on NNMT and CVS are much lower, as would be expected, 
namely 4.32 and 26.33 respectively. The correlation between NNMT and CVS, 
however, is .65 and that between age and NNMT —.09. While severe deficit is pre- 
sent, it seems to be of a different type than that related to age since the CVS-NNMT 
relationship is not being upset by it and the previously found negative relationship 
between age and NNMT does not appear. Differential effects of some kind are show- 
ing up here which may have both diagnostic and theoretical importance. 

The 50 paretics* had an age range of from 33 to 55 with a mean of 48.38 and 
standard deviation of 5.92, roughly comparable to the 65 firemen. Their mean CVS 
score was 17.36 with a range from 4 to 40. The mean NNMT score was 2.22 with a 
range of 0 to 5. Chance expectancy on the NNMT would be 3, since there are 15 
items, each of which has 6 alternative answers. The paretics seem unable to “handle” 
our matrix situation. As would be expected under these circumstances, the correla- 
tions between CVS and NNMT, and NNMT and age are neglible, .01 and .02 res- 
pectively. Again we find deficit on both CVS and the NNMT with the NNMT 
particularly sensitive. 

The mental defectives ranged in age from 16 to 30 with a mean of 20.46 and a 
standard deviation of 3.10. The CVS mean was 12.24 and the NNMT mean 2.90. 
Again we find the NNMT particularly low, the mean of 2.90 being about chance 
expectancy. Once more the correlations between CVS and NNMT (.12) and NNMT 
and age (.07) are neglible. The NNMT seems suitable for differentiating mental 
defectives from normals as only 9 per cent of the latter group have scores as low as 
the mental defectives. 

o-* ScatreR ANALYsIS 

In our previous work we have had some success in using scatter measures based 
upon the difference between vocabulary scores and scores on other tests. We also 
tried such a measure in this study. Since we have comparable standard scores work- 
ed out by Wechsler’s technique on both V and NNMT, we defined scatter as the ap- 
pearance in any case tested of an NNMT score one standard deviation below the V 
score. Table 7 gives the results. In confirmation of our previous findings with other 


TABLE 7. Scatter Data For ALL Groups 


IV % Scatter 
Young Normals 445 14 
Older Normals 104 67 
Defectives 100 4 
Schizophrenics 40 80 
Pareties 50 50 








tests: ©) we find differences between our groups, with the younger normals showing 
14 per cent scatter, the schizophrenics 80 per cent and the paretics 50 per cent. We 
should note that as in our previous studies “: 7: ®) our mental defectives, selected as a 
group of simple familial cases with no organic or emotional involvement present, 
show the least amount of scatter of all groups. When used with a vocabulary test 
NNMT thus seems to offer a satisfactory scatter measure for indicating possible psy- 
chological deficit in test performance. 

Our results show some interesting differences on NNMT appearing between 
our normal and clinical groups. To investigate these further, we analyzed the differ- 
ent types of error appearing on the wrong responses. It will be remembered that each 
response card has four types of erroneous alternatives on it in addition to the correct 
answer: (1) the right symbol in the wrong place, (2) the wrong symbol in the right 


‘Our thanks are due Mr. Joseph Matarazzo who contributed these data to our study. 
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place, (3) a repetition of one of the previous stimulus cards, and (4) an irrelevant 
symbol. Table 8 gives the per cent of total errors for each category for each group 


TABLE 8. Error DisTRIBUTION FOR ALL Grouprs 
%Errors %Errors 
Right Symbol Wrong Symbol % Errors % Errors 
Group N Wrong Place Right Place Repetitive Irrelevant 





Recruits 462 31 32 23 14 
Older Normals 104 29 32 2 16 
Schizophrenics 40 20 24 26 30 
Pareties 50 20 19 24 37 
Defectives 100 19 18 24 38 


Chi-Squares for error distribution 
Recruits—Older Normals 4 not sig. 
Total normals-Schizophrenics $.- 1% 
Defectives—Paretics .78 not sig. 
Total Normals—Defectives and Paretics % 
Schizophrenics—-Defectives and Paretics 





and the chi squares for the differences in error distribution between groups. Some 
trends are immediately apparent. Both normal groups made their greatest number 
of error responses in the first two categories with fewer in the third and even fewer 
in the fourth. The schizophrenics made an approximately equal number of responses 
in each category. The paretics made an approximately equal number of responses 
in the first three categories and more in the fourth. The defectives made an equal 
number in the first and second, more in the third, and many more in the fourth. 
Comparing the groups we find both normal groups having more errors in the first two 
‘ategories than the clinical groups (though not significantly more than the schizo- 
phrenics), the same amount in the third, and many fewer in the fourth. By the chi 
square test the two normals groups were not significantly different from each other 
with respect to error distribution, but were different from the other groups beyond 
the one per cent level of confidence. The schizophrenics were different from the par- 
etics and defectives at the five per cent level of confidence and the paretics and de- 
fectives were not different from each other. 

In order to make sure that the greater number of total errors made by the clin- 
ical cases was not favoring the differences, we matched as many as possible of the 
clinical cases with normals of the same approximate age with respect to NNMT 
scores, thus equating the total number of errors. We also matched the older normals 
with the recruits. The mean number of responses in each error category was then 
obtained for each sample and the significance of the differences between each pair of 
means was determined. The results substantiated our previous findings. The schizo- 
phrenics differed significantly from the normal controls with respect to mean number 
of responses in category No. 4. They had fewer responses in categories No. 1 and No. 
2, though these differences were not significant. The paretics differed from the nor- 
mals in having fewer responses in category No. 2 and more in No. 4. The defectives 
differedl from the normals in all but category No. 3, having fewer responses in cate- 
gories No. 1 and No. 2, and more in No. 4. The two normal samples differed from 
one another in no category. 

Thus there appears a tendency for the normal groups to make more errors in 
the first two categories while the clinical groups pile up on the fourth. One explana- 
tion of these findings could be that the first two categories are “‘better’’ answers. 
They have an element of “‘rightness’’. In one case it is the form of the design seg- 
ment and in the other case it is the spatial position of the segment. The person 
choosing one of these is half-right and has, if the choice were made purposively and 
not by guessing, demonstrated some discriminative ability. Category No. 4, on the 
other hand, has no correct element. It is definitely and obviously not similar to the 
others, therefore it is immediately rejected by those capable of comprehending the 
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task. Category No.3 is not so obviously incorrect, but neither does it usually have any 
element of rightness and therefore it is not so frequently chosen. The clinical groups, 
on the other hand, are unable to make a discrimination between the first two categor- 
ies or even the first three since they all fit into the total pattern created by the stim- 
ulus cards to some extent. The task is too difficult for the clinical groups and they 
may respond to the frustration by going out of the field and choosing the one response 
that is completely different. Another possible explanation is that some primitive 
figure-ground effect is working for the clinical cases and emphasizing the irrelevant 
response, thus attracting the clinical subjects who cannot comprehend the task. In 
any case, differences are appearing that indicate the diagnostic possibilities of our 
multiple choice technique. 


SUMMARY 


A brief non-verbal intelligence test was designed to satisfy two criteria—good 
correlation with existing standard verbal tests and diagnostic potentiality either 
when used alone or in combination with other brief tests. When given to two samples 
of recruits totaling 445 subjects, the test showed good correlations with CVS and 
GCT alone, and increased the correlation of CVS with GCT when combined with the 
former. The results from the two samples were substantially the same. When given 
to a number of clinical groups the test was able to differentiate between these and 
the normals on the basis of total score and types of errors made. When combined 
with a brief vocabulary test it was an effective indicator of deficit as measured by 
test scatter. Although these results are frankly preliminary, we believe them to be 
interesting and suggestive, and feel that the test merits further study. 
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A STUDY OF MALINGERING ON THE CVS ABBREVIATED INDIVIDUAL 
INTELLIGENCE SCALE* 


PENELOPE P. POLLACZEK 


Northwestern University 


INTRODUCTION 


While the subject of malingering has always been an interesting one, particular- 
ly in the military services, relatively little work has been done on the subject of 
malingering on tests of intelligence. During the last war the problem of malingering 
on such tests was examined with enough success to warrant its further investigation. 
Hunt and Older®? studied the performance of malingerers on three brief measures 
of intelligence: Arithmetical Reasoning (10 arithmetic problems), Easy Directions 
(20 simple tasks to be carried out), and the 1941 revision of the Kent Emergency 
Test. Goldstein“? designed a key for malingering to be used with the Army Visual 
Classification test. In both of these studies investigation was carried out upon ex- 
perimental malingerers, i.e., individuals who were asked to simulate a mentally de- 
fective condition but who were not true malingerers in the sense that their decision 
to malinger was self-initiated. These studies showed that malingerers could be dis- 
tinguished from mental defectives on the particular tests used. In general there 
seemed to be a distinguishing tendency for the mental defectives to pass easy items 
and fail hard ones while the malingerers failed some easy items and passed some 
hard ones. 

The importance of detecting malingering, however, is not limited to the mil- 
itary situation but is evident as well in court clinics and prison situations where the 
falsification of test performance may have vital consequences for the delinquent 


individual who is being tested. The problem also may arise in ordinary clinical 
practice. It therefore was deemed important by the author to find whether or not 
the results of Hunt and Older and of Geldstein were specific to the tests they employ- 
ed or could be generalized to other test materials. The objectives of the present study 
are to see whether or not the responses of malingerers on a brief intelligence battery 
‘an be distinguished from those of true mental defectives and, if so, whether or not 
a scoring key can be constructed to use in the detection of such malingering. 


PROCEDURE 


The test used in this study was the CVS abbreviated individual intelligence 
scale as developed by Hunt and his coworkers @: *), The test consists of the Compre- 
hension and Similarities subtests of the Wechsler-Bellevue test ® along with a Vocab- 
ulary test which is made up of items selected from the Stanford-Binet Vocabulary 
test“). The sum of the scores on the subtests constitutes the overall score on the 
CVS. 

The subjects used in this experiment were 50 male Naval recruits, 50 male col- 
lege students, and 50 male mental defectives as controls. The age range for the col- 
lege group was from 18 to 29, for the Navy group from 17 to 21, and for the mental 
defectives from 17 to 30. For the college group the range of intelligence in terms of 
IQ (as measured by the Otis test) was from 106 to 144. For the 31 Naval recruits on 


*This study is one part of a broader study of malingering submitted in partial fulfillment of the 
requirements for the Ph.D. degree at Northwestern University. It was done under the sponsorship of 
the Office of Naval Research as part of a larger project under the direction of Prof. William A. Hunt. 
The opinions expressed, however, are those of the author and do not represent the opinions or policy 
of the Naval service. Thanks are due Dr. Hunt for his direction of the study and to Dr. Janet A. Tay- 
lor and Dr. George Collier for advice on the statistical aspects of the work. Gratitude should also be 
expressed to the staffs of the Linco State School and Colony and the Great Lakes Naval Training 
Center for their cooperation in providing subjects. 





AAAI LLL ALLELE 


2 hain 


76 PENELOPE P. POLLACZEK 


whom a measure of general intelligence was available, the standard scores on the 
Navy General Classification Test ranged from 25 to 69. (The Navy mean is 50). 
All of the mental defectives had been classified as being at the high grade moron 
level according to previous tests of intelligence. They were all inmates of the Lincoln 
State School and Colony, an institution for mental defectives at Lincoln, Illinois. 
It was felt that emotional condition and organic pathology might be important var- 
iables to control in this study and so only mental defectives of the familial type (on 
the basis of unanimous staff diagnosis) and without deficiencies due to gross organic 
pathology or emotional difficulties were used. None of the college group was under 
therapy or treatment, nor were there any obvious signs of maladjustment. Since the 
Naval recruits had been through a medical and psychological screening process, it 
Was assumed that there were no gross maladjustments present among them. 

‘The instructions given to the mental defectives were the standard instructions 
for each particular test. In addition to the usual instructions the two experimental 
groups of malingerers were read the following instructions which were designed to 
explain as clearly as possible the experimental task at hand: 


“You have been chosen to take part in an experiment. The results of this experiment 
will give us the answers to some impertant questions and we are counting very highly on you 
to help us find the answers. 


“We want to find out what people do when they try to fake a psychologicai test. You 
are going to be given some tests which measure intelligence: but your job in this experiment 
will be to try to appear stupid. We already have a measure of your intelligence and know that 
you certainly do not fall in the feebleminded class; but we are interested in what people would 
do if they tried to play dumb on this test. You are to imagine that you are trying to get out of 
the service by appearing to be feebleminded. Play dumb and try to do what you think a 
feebleminded person would do. 


“Let us take an example. Joe Smith is a boy of draft age who doesn’t want to serve in 
the military forres.. He decides that the thing for him to do is to fake feeblemindedness on 
the intelligence test. Joe figures that by playing dumb on the intelligence test he is less apt 


to be caught than if he.were to try some other method of avoiding service. What you are to 
do in this experiment is‘to pretend that you are Joe Smith and when you are given the psy- 
chological tests try to fail them in such a way that you will make a score like that of a feeble- 
minded person. Are there any questions? Remember from now on you are to pretend that 


you are Joe Smith and answer the questions the way you think a feebleminded person 
would.” 


The subjects in all of the groups seemed to be highly motivated. The inmates 
of Lincoln are sensitive to the importance of intelligence tests in determining their 
destiny. The college students were volunteers and participated in the experiment 
with a considerable amount of interest. The Naval recruits were in a situation under 
command in which cooperation can be assumed. 

Before turning to a consideration of the results of the present study an explana- 
tory word is necessary regarding the use of experimental malingerers. As it is almost 
never possible to obtain true malingerers for experimental purposes, most experi- 
menters have substituted experimental malingerers. The question arises as to 
whether or not such a procedure is justified. In answer to this question one can point 
first to the reports of Hunt and Older and Goldstein“? who observed empirically 
that it was possible to detect true malingerers on the basis of test differences. The 
experimental group and the genuine malingerers were reported to give the same test 
picture. Apart from this empirical evidence one might offer the logical argument 
that true malingerers and experimental malingerers are adopting the same “‘set’’, 
namely to simulate some condition different from that which is truly representative 
of themselves. ‘There is little reason to suspect that any other personality or intel- 
lectual factors are operating in any important fashion to distinguish the behavior 
of the two types of malingerers. As pointed out by Goldstein“? “Although the 
motivation of experimental and true malingerers is different, it is difficult to see how 
this can alter their mode of performance on the examination. In both instances, the 
objective is to feign feeblemindedness and to fail-in-such-a-fashion-as-to-deceive.” 





A STUDY OF MALINGERING 77 


RESULTS 


The first point of interest in comparing the experimental and control groups was 
to determine whether or not they differed significantly in mean scores on the CVS 
test. The results are presented in Table 1. As can be readily seen from the table, 





TABLE 1. MEAN CVS Scores ror Ati Groups 





M o 


Lincoln 11.40 4.955 
College 11.92 6.051 
Navy 11.64 6.657 








there are no significant differences between the means of the different groups. When 
we compare the standard deviations of the experimental groups with that of the 
control group, we find that a difference exists between the standard deviations of 
the Navy and Lincoln groups which reaches the 5°, level of confidence. The differ- 
ences between the standard deviations of the college and Lincoln group reaches the 
8°; level of confidence. A difference between the standard deviations of the control 
and experimental groups might be expected since the experimental malingerers repre- 
sent a more heterogeneous group than the truly feebleminded group. The experi- 
mental malingerers adopt widely different sets when told to malinger and show great 
dissimilarity in their ability to malinger successfully. 

It must be concluded then, from a comparison of mean differences between ex- 
perimental and control groups that malingering as measured by total overall score 
on the CVS cannot be detected. These results are somewhat in contradiction to 
those obtained earlier by Hunt and Older® and by Goldstein“. Using an Arith- 
metical Reasoning test, and Easy Directions test, and Kent’s revised Emergency 
test, Hunt and Older®? report that: “ Although the malingerers are able to conceal 
their true mental age on these tests, they do not sueceed in getting down to the real 
level of the feebleminded. They act ‘dumb’ but not ‘dumb’ enough.” In contrast 
to this our results would indicate that it is not possible to differentiate on the basis 
of mean scores in all cases and so some other method of detecting malingering must 
be used. 

The next step was to investigate whether or not there were any significant differ- 
ences on individual items between the experimental and control groups even though 
no differences appeared in the total score. If such differentiating items proved to be 
present, it should be possible to construct a key for the-detection of malingering. At 
this point it should be remembered that on Comprehension and Similarities the cor- 
rect answer may be scored either “1” or ‘‘2” depending upon the quality of the re- 
sponse. We have treated such qualitatively different responses as if they were separ- 
ate items. For example, in answering the question: “‘In what way are a wagon and 
bicycle alike’, the answers: “They are both vehicles” and “Children play with 
them”, are scored “2” and “1” respectively and are hence treated as if they were 
separate items. 

Table 2 shows in proportions the number of correct responses given to each 
item by each group. It may be seen readily that there are some large differences 
between the Lincoln group and both the College and Navy groups. It may also be 
seen that with few exceptions the Navy and College groups are in close agreement. 

It was decided that where the difference between proportions on any items be- 
tween the control group of mental defectives and either one of the experimental mal- 
ingering groups was significant at the 5°; level or better, these items could be used 
in building a key for malingering. In testing for the significance of difference between 
two proportions it was necessary, because each of the sample N’s was below 100, to 
follow the procedure of basing the sampling error on the combined proportion of each 
of the two groups on any one item as McNemar suggests “?. McNemar also presents 
a rule-of-thumb criterion ‘‘for ascertaining when it is unsafe to use the standard 





POLLACZEK 


Pa 
Q 
ou 
° 
4 
a 
Zz 
4 
Aa 

















8é° 
66° 


06° 


UJOOUNT 
































O° 
PL" 
0 


0g" 


ch 82" 
of" } Ze" Z0° ZO" 





AABN aHa[OD upooury AAGN ada[[9D upooury AARN aa] [9D JPAa'T 
ArvenqnooA SOTPLIV]IUUIG uoisuayard uo, 


(SuI9}1 8} Bavdos SU payed YOR SaTLIVIUMIg pur UoIsUayaIduIO_ UO Z puUB | Jo S8109g) 
SAO NO WAL] HOVY ONISSVG dQOUr) HOV AO NOWLUOdOUD ‘JZ AAV], 





A STUDY OF MALINGERING 79 


error of the difference between proportions . . . ’’ Following the rules laid down by 
McNemar several of the differences between proportions did not meet the rule-of- 
thumb criterion and were immediately eliminated from any further consideration. 
There remained, nevertheless, a number of significant differences when the formula 
was applied. The ‘‘t’s” of these differences are given in Table 3. 


iE SIGNIFICANCE OF DIFFERENCES BETWEEN MEANS ON Key ITEMS 





t’s for t’s for 

difference difference 

between Group between Group 
College and making the Navy and making the 
Lincoln means _ higher score Lincoln means higher score 





3.0774** 3.0774** 
3.3788** 
5.7737** 
3.4788** 
5.6180** 


Comprehension 


ey Cyrene Cred eyo 


.2092* 
7278** 
.6180** 
.2371* 


.0833* 
.2097* 


3.6929** 
2.0833* 


2.0815* 


3.6257** 
3.9283** 
.9835* 


4.9730** . 1035** 
2.6667** 


Nob Nort 


CHU WWNID 


— 





Similarities 


ary = hobo 


-_ 
SDD ot et 


.4584** 


4.1873** 
3.1918** 
4. 8622** 
3.5000** 





Vocabulary 


1 
2 
4 
6 
7 
8 
9 


2.8763** 


*5% level of confidence L=tincoln 
**1Q% level of confidence C—College 
N=Navy 





An examination of Table 3 indicates that there are fourteen items on the CVS 
that differentiate the College from the Lincoln group at the 1% level of confidence, 
while three items differentiate these two groups at the 5% level of confidence. In 
comparing the Lincoln and Navy groups there are nine items on the CVS discrim- 
inating significantly at the 1% level of confidence. Also, it is interesting to note that 
on eight items the Navy and College groups are in agreement in differing significantly 
in the same direction from the Lincoln group. With one exception, on all other items 
where one of the experimental groups differed significantly from the control group, 
the other experimental] group showed a difference in the same direction and on several 
items the difference was just short of the 5° level of confidence. 

The fact that certain items on the CVS differentiated between the groups made 
the development of a key for detecting malingering a definite possibility. Six differ- 
ent keys for the CVS were then proposed, each one based on some particular prin- 
ciple. Key No. 1, for example, was composed of all the items in which the College 
group differed significantly from the Lincoln group at the 5° level of confidence or 
better. Key No. 2 was similarly developed substituting the Navy group for the Col- 
lege group. Key No. 1 was composed of 17 items while 14 items made up Key No. 2. 
Key No. 3 was composed of all of the items included in Keys No. 1 and No. 2 mak- 
ing a total of 23 items. The 8 items which were common to Keys No. 1 and No. 2 
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were combined into a separate key, No. 6. Key No. 4 was composed of those items 
in Keys No. 1 and No. 2 which answeredt he criterion of 1°% level of confidence or 
better, resulting in 17 items. Key No. 5 was composed of items which reached the 
5% levels of confidence or better on Keys No. 1 and No. 2 with the additional cri- 
terion that whenever an item reached the 5°; level of confidence or better for either 
the Navy or the College group, it must also show a strong trend (though not neces- 
sarily reaching the 5°; level of confidence) where the other group is concerned. This 
selection process resulted in 18 items. Each individual in all groups was scored on 
each key and distributions for each group were prepared. 

In order to determine whether or not the keys were actually discriminating be- 
tween the experimental and control groups, the Mann Whitney U test“ was em- 
ployed to test the hypothesis that the distributions of experimental and control 
groups were significantly different.! All of the keys tested in this manner showed 
significant differences considerably beyond the 1% level of confidence. It might be 
argued that Mann and Whitney’s technique is not altogether appropriate here since 
there are a large number of ties in the malingering scores between the experimental 
and control groups. On the other hand the fact that the obtained ‘‘U’s” were so 
large, placing significance at considerably beyond the 1°% level of confidence, certain- 
ly suggests that definite differences exist between the distributions of the two popu- 
lations. 

In order to determine how many malingerers would be detected successfully 
and how many false positives would be picked up among the mentally defective 
group through the use of any particular key, all of the keys were applied to each 
group using cut off points. On the basis of this check along with the previous statis- 
tical check, Key No. 4 (composed of those items on which either of the experimental 
groups differed from the control group at the 1% level of confidence or better) was 
selected as the most promising. The results of applying Key No. 4 to the experimental 
and control groups are given in Table 4. Inspection of this table shows that if 


TABLE 4. NUMBER OF INDIVIDUALS IN Eacn Group PIckKEp Up By Key No. 4 
(Resutts GIvEN IN CUMULATIVE shoueemacontnd 








Number of Items Navy College Lincoln 
3 1.00 
4 1.00 84 
5 98 .60 
) .98 P 44 
.94 - 96 .22 
.S4 -f -10 
.64 ; 
46 
.20 
.O8 
.04 
.02 








Key No.4 is used with a cutting point defined as a malingering score of eight or above, 
84°, of the Naval group and 90°; of the College will be picked up while 10°% of the 
mental defectives will be falsely identified. Im terms of the usual standards for 
screening tests this performance would be considered good and suggests the potential- 
ity of this key on the CVS for the detection of malingerers. Key No. 4 is presented in 
table 5. 


‘This is a non-parametric test to determine whether two distributions of scores are statistically 

different. In this test no assumption is made concerning the nature of the score distribution or the 

equality of variance. With samples of this size, the statistic U is normally distributed with the 
: mn (m-+n-+1) 

M =mn/2 ando= ~* The probabilities reported were obtained from a normal proba- 


12 
bility table. 





A STUDY OF MALINGERING 


Taste 5. Key No. 4 For DETECTING MALINGERING ON CYS 








| 
Item No. Response | Item No. Response 





Comprehension 1,1 0 Vocabulary 1 
9 0 2 

0 4 

0 6 

0 8 

0 | 9 


Similarities 








On those items marked zero the feebleminded group tended to pass the item while the malingerers 
tended to fail it. Those marked one indicate a tendency to fail the item on the part of the feebleminded 
and a tendency of the malingerers to pass it. Therefore a failure on the items marked zero and a suc- 
cess on those marked one each contributes one point to the malingering score. 


It should be pointed out that while the use of two experimental groups gives some 
evidence of cross-validation, further cross-validation of all of these keys would be 
necessary before they would be ready for use. In view of the promising results of this 
study, however, it seems safe to conclude that malingering on the CVS abbreviated 
individual intelligence scale can be detected and successful keys can be developed 
for this purpose. 


SUMMARY 


A study was made of the possibility of detecting malingering on the CVS ab- 
breviated individual intelligence scale. Three groups were used: 50 college males 
and 50 Naval recruits as experimental groups, and 50 male mental defectives for 
control purposes. The experimental groups were requested to simulate mental de- 
ficiency on the CVS test while the genuine mental defectives were given the test in 
the usual manner. Comparison of the malingering groups with the mentally de- 
fective group indicated that although it was impossible to detect malingering using 
total test score, enough significant differences between the experimental groups and 
the control groups appear on individual items to make a key for malingering on the 
CVS a distinct possibility. Such a key was prepared and applied to all the groups 
used in the study. Roughly 90°; of the malingerers were detected and only 10°; of 
the mental defectives were falsely identified as malingerers. This is considered to be 
sufficiently good differentiation to suggest the value of the key for use in screening 
malingerers in a practical testing situation. Further cross-validation is necessary. 
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INTRODUCTION 


Recently there has been increasing interest in the use of intra-individual test 
discrepancies as a measure of functioning efficiency. One of the major problems in 
establishing scatter patterns has been that of finding a suitable internal reference 
point. The three reference points most commonly employed are the mean or IQ, the 
vocabulary score, and the altitude or maximum score. Psychologists who believe 
that the G factor is an ability argue for the use of the IQ as a reference point while 
those who believe that it is a capacity use altitude as a reference point. In regarding 
intelligence as a capacity or potentiality rather than an ability, vocabulary and alti- 
tude proponents consider that test scores falling significantly below the potentiality 
indicate mental inefficiency “? or personality disorganization ©. 

Wechsler“, Magaret “, and Rapaport“? have utilized means as reference 
points in studying scatter on the Wechsler-Bellevue. For Wechsler and Magaret 
this mean is the average of all of the individual’s sub-tests while Rapaport utilized 
deviations from a modified verbal mean and from the performance mean. All of these 
serve as fairly stable reference points because they are average scores. Jastak © 
points out, however, that the averaging of such heterogenous measures results in a 
psychologically ambiguous score. 

Vocabulary test scores have been used as reference points by Babcock and 
Rapaport“. Use of vocabulary has been claimed advantageous in that it is relative- 
ly unaffected by mental disorganization. The major disadvantage in its use is that 
a verbal test underestimates the intelligence level of persons who do better on per- 
formance tests than on verbal. 

Jastak introduced the concept of altitude as a reference point for scatter an- 
alysis. He believes that clinical experience has shown the top score rather than the 
mean score to be most closely related to the individual’s native endowment. He states 
that persons of ‘‘normal’’ personality structure tend to adjust to life at a level com- 
mensurate with the highest score. Wide deviations occurring between the capacity 
and the functioning ability (for example, considerable discrepancy between altitude 
and other subtest scores on the Wechsler) indicate maladjustment. 

Whiteman “*? further studied the feasibility of using altitude as a reference 
point in the derivation of scatter patterns. He found significant differences in the 
subtest deviations from altitude between a group of schizophrenics and a group of 
nurse applicants. 

The hypothesis that anxiety produces disturbance in test performance has been 
tested in various ways. Rashkis and Welsh “? made up a list of signs which appeared 
to differentiate, in terms of performance on the Wechsler, between those cases:in 
which anxiety was judged to be a prominent feature and those in which anxiety was 
not declared essentially contributory. These signs consisted of “temporary in- 
efficiency”? on any of the Wechsler subtests. Shoben“? gave Wechslers to a group 
of thirty-five college men. Eighteen of these men had standard scores of sixty-five 
or above on the neurotic triad of the Minnesota Multiphasic Personality Inventory 


*Reviewed by the Veterans Administration and published with the approval of the Chief Medical 
Director. The statements and conclusions published by the authors are the result of their own study 
and do not necessarily reflect the opinion or policy of the Veterans Administration. 

The authors wish to express their appreciation to Drs. M. R. Jones and I. Simos for their sug- 
gestions and criticisms of this paper. 
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(hereafter MMPI). These men made up the “anxious” group. The remaining seven- 
teen subjects were judged to be ‘“‘nonanxious.”’ No significant differences were found 
between the groups with regard to the Rashkis-Welsh anxiety signs. The present 
authors question the use of the neurotic triad as an indicator of anxiety since one of 
the scales, the Hy scale, does not seem particularly relevant to the detection of 
anxiety. 

Rapaport “ suggested two Wechsler-Bellevue signs which would distinguish 
between anxious and non-anxious patients: (a) a Digit Span score much below the 
Vocabulary level and /or the mean Verbal level is mainly indicative of the presence 
of anxiety, (b) “Impaired efficiency on the Object Assembly subtest may be a re- 
flection of depressive or anxiety trends or both.” Gilhooly © attempted to validate 
these signs with a group of fifty-two psychoneurotics in whom anxiety was a primary 
feature and another group of neurotics in whom anxiety was not important. Neither 
of these hypotheses was substantiated. 

Warner “® attempted to determine (a) whether anxiety neurotics differ sig- 
nificantly from normals with respect to interest variability, (b) whether anxiety 
neurotics differ significantly from normals with respect to the difference between 
verbal and performance subtest scores, and (c) whether there are any subtest pat- 
terns that allow differentiation between groups of anxiety neurotics and groups of 
normals. He found significant differences between the psychometric pattern of a 
group of normals and the anxiety neurotic group. His findings indicated that anx- 
iety reurotics do better on concrete tasks, in comparison with more abstract ones, 
than do normals. He was unable to find a significant difference between normals and 
anxiety neurotics with regard either to interest variability or to difference between 
verbal and performance subtest scores. 

In view of the previous lack of success it was decided to try a new approach 
which would utilize a more global measure of scatter. For purposes of this study. 
the authors are explicitly accepting the assumption that altitude is an index of 
intellectual potential. The central hypothesis of this study is that there is positive 
rectilinear relationship between altitude-IQ discrepancy scores and degree of anxiety 
as indicated by the MMPI. A second hypothesis is that there will be a positive 
rectilinear relationship between altitude-IQ discrepancy scores and overall degree of 
personality disturbance as measured by the MMPI. A third hypothesis is that a 
technique for calculating discrepancy scores which takes into account possible 
“natural” differences between verbal and performance capacities will yield a sig- 
nificantly higher correlation than one which does not. The latter hypothesis is based 
upon a factor analysis of the Wechsler-Bellevue(?) which demonstrated that the two 
factors appearing most consistently throughout a wide age range were a verbal and 
a performance factor. 

PROCEDURE 

The subjects used were 82 patients seen at the Veterans Administration Hos- 
pital in Lincoln, Nebraska, the Psychological Clinic at the University of Nebraska, 
and the Veterans Administration Mental Hygiene Clinic in Omaha, Nebraska, with- 
in a two year period (January 1949 to January 1951). All of the subjects had been 
administered both the Wechsler-Bellevue test and the MMPI within a two week 
period. The subjects were unselected as far as diagnostic classification is concerned. 
However the population from which they were drawn consisted largely of diagnosed 
psychoneurotics. Those who showed evidence of organic brain damage were not 
included as the study is concerned only with the “‘functional’’ disorders. Several 
subjects were also eliminated on the basis of invalid MMPI profiles as determined 
by the F-K< criterion where the difference is larger than ten. 

Discrepancy scores on the Wechsler-Bellevue were calculated by the following 
procedures: (a) Using a method similar to that of Whiteman “*) the highest subtest 
weighted score was multiplied by five, the second highest by three, and the third 
highest by two to provide the equivalent of ten subtest scores and to weight the 
scores at the ceiling of the individual’s achievement. The quotients were then found 
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in the usual manner in Wechsler’s Full Scale tables. The subject’s Full Scale 1Q was 
then subtracted from this altitude quotient yielding a discrepancy score. (b) A sec- 
ond discrepancy score was then computed by using a combination of top verbal and 
top performance scores. This was done to rule out the possible effect of any “natural” 
differences in the verbal and performance capacities of the subjects. In this second 
method, the two top verbal scores and the two top performance scores were added 
together and multiplied by 2.5 so that their sum would be comparable to the total 
weighted score of the Wechsler-Bellevue scale. Quotients were also found for these 
scores in the usual manner and again the Full Scale IQ was subtracted from this 
altitude quotient to yield a discrepancy score. 

In order to obtain a somewhat objective measure of anxiety the MMPI scale 
measurements of Hypochondriasis, Depression and Psycbasthenia were used. These 
three scales were chosen, because, of the four so-called ‘“‘neurotic” scales, these three 
are the most likely to measure the vague fears, overconcern and worry character- 
istic of an anxiety reaction), Arbitrary weights were assigned to the MMPI scale 
heights as follows: A height of 60 to 69 was given a weight of one, 70 to 79 a weight 
of two, 80 and above a weight of three. Treating each scale as equal in importance 
gave us a range of nine points for our estimations of the degree of anxiety. This value 
was then plotted on a scatter diagram against the discrepancy scores. The relation- 
ship was found to be rectilinear and Pearson r’s between the anxiety scores and the 
discrepancy scores (a) and (b) were computed. 

In order to obtain an overall estimate of ‘degree of disturbance” the weighted 
scale height scores for all of the MMPI scales, using the same weighting system, were 
summed and the resulting scores plotted. These scores were also found to be rectil- 
inearly related and again Pearson r’s were computed. The entire MMPI profile was 
used in this measurement in order to define degree of disturbance broadly enough 
to include all behavioral deviations measured by the MMPI under the assumption 
that those deviations are produced by an emotional disturbance of some type. 

RESULTS 

As can be seen in Table 1, of the two correlations between discrepancy scores 
and anxiety, one is significant at the 1; Jevel of confidence and one at the 5“; level. 
The correlations between discrepancy scores and all MMPI scales were somewhat 
lower, suggesting that anxiety was a primary variable in producing altitude-IQ 
discrepancies. ‘To verify this the correlation between discrepancy scores and all 
MMPI scales excluding the anxiety trio was calculated. The test of significance of 
the difference between the correlation obtained using the anxiety scales and that 
obtained using the remaining six MMPI scales yielded a t of 2.57 (significant at 
the 2°, level of confidence). 


TABLE 1. SHowinG CorreELATIONS AND THEIR SIGNIFICANCES BETWEEN DiscrEPANCY SCORES 
(CALCULATED BY Mretruops A and B) anp Vartous MEAstrEs BAsED ON MMPI Scores. 


us Ft. D all9 MMPI All seales except 
Scales Scales Hs, D, Pt 


Method , Signif. r Signif. | r Signif. 


A > 5% 14 


no .O1 no 


B .d Hf .22 5% 08 no 


_ While the use of method (B) results in a slightly higher correlation than the use 
of (A), the difference does not reach the 5°; level of confidence. 
Discussion 


Although a correlation of .31 cannot be used for individual predictive purposes 
it suggests that there is a reliable relationship between our measure of anxiety and 
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scatter as measured by the altitude method. Further work in this area may suggest 
appropriate cutting scores to indicate efficiency of functioning. 

Some of the limitations of this study are as follows: First, the criterion measure 
of anxiety was imperfect. Scales derived as specific anxiety indicators may be better. 
These were unavailable to the experimenters. Second, the fact that the Wechsler 
tests were administered by different examiners left another variable uncontrolled. 
Similarly no attempt was made to see whether the subjects falling in different 
anxiety categories were matched on other possibly significant variables such as age. 
Fourth, although the concept of altitude as a potential may be a valid one, the 
Wechsler-Bellevue test, because of the nature of its material, may be relatively in- 
sensitive to impairment of intellectual functioning as compared to some of the 
projective devices. Fifth, and this is particularly pertinent to the present study, 
the range of subjects used was quite narrow, consisting mainly of diagnosed neu- 
rotics. Therefore it is encouraging that in spite of this attenuating factor the method 
yielded a significant relationship. It seems apparent, in view of the foregoing dis- 
cussion, that for a less homogenous population there may be an even more significant 
relationship than the correlation reported here suggests. 


CONCLUSIONS 

1. There is a significant rectilinear relationship between degree of anxiety and 
altitude-IQ discrepancy score. 

2. With regard to the second hypothesis, anxiety, rather than overall degree 
of disturbance, seems to be a primary variable in reducing functioning efficiency 
among this group of neurotics. 

3. When possible intraindividual differences in verbal and performance cap- 
acities are taken into account the correlation between discrepancy scores and anxiety 


is somewhat higher, although the difference between the two methods does not reach 
the 5°) level of confidence. 


4. This study indirectly lends support to Jastak’s assumption that altitude 
score is a better indication of an individual’s intellectual potential than is the 
Wechsler IQ. 

5. It seems likely that the correlation reported above is a conservative estim- 
ate of the true relationship in view of the uncontrolled factors involved. The fact 
that it is a significant correlation is an encouraging sign for the new technique of 
exploring relationships between subtest scatter and anxiety. 
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PURPOSE 

In the Army, soldiers who commit serious offenses are subject to a general court 
martial and may be sentenced to a disciplinary barracks, the mission of which is 
the rehabilitation of prisoners. In such a disciplinary barracks, the Psychiatric and 
Sociology Section processes all incoming prisoners and maintains continued contact 
with the prisoner for the purpose of diagnosis and treatment. Recommendations 
are also made by this section concerning the assignment to duty or type of training 
at the disciplinary barracks. The purpose of this study is to study the relation be- 
tween psychiatric classification of Army general prisoners and scores on the MMPI. 


MertrHop 
One of the tests frequently used by the psychologists in the Psychiatrie and 
Sociology Section is the MMPI (Minnesota Multiphasic Personality Inventory). 
This test was chosen for the present study for the reason that it is the only widely 
used structured test which gives a measure of the psychopathic deviate. In the 
present study, MMPI’s were available for 40 prisoners who were diagnosed as hav- 
ing no neuropsychiatric disorder; 53 with diagnoses of emotional instability; and 43 
with diagnoses of anti-social personality. These diagnoses were made, of course, on 

the basis of one or more psychiatric interviews. 


RESULTS 
The mean scores on the MMPI obtained by the prisoners in these three psychi- 
atric classifications are shown in Table 1. Herein are presented the mean scores ob- 


TABLE 1. MEAN ScorEs ON THE MMPI ror Tourer Groups (DETERMINED BY PsYCHIATRIC 
DIAGNOsIS) AND A NORMATIVE GROUP BY SCHMI 
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No 
Neuropsychiatric Emotional Anti-Social Schmidts’ 
Disorder Instability Personality “Normals” 
Mean S.D. Mean = ‘8..D. Mean S.D. Mean S.D 
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53.13 9. 58.5 3.05 59.65 2.85 49. 

.75 3.45 62.97 13.90 4.65 5.15 48 
3.13 45 65.12 .20 9.90 .85 48 


nemokShon | 


PURO ONNe 


.~ 
= 


| WorerOnorenwor 


| 
| 
| 
| 


tained on nine of the clinical sub-scales and two of the validating sub-scales, as well 
as mean scores obtained on a comparative group of ‘‘normal”’ soldiers presented in 
an article by Schmidt ©. Moreover, this material is presented graphically in Fig- 
ure 1, which gives the profiles representing the mean scores on these various sub- 
scales. An examination of Table 1 and Figure 1 indicates that the general prisoners 
of all three categories are considerably more deviant than Schmidt’s ‘‘normals”’. 
The order of these four groupings in terms of elevation of MMPI sub-scales follows: 
(1) Schmidt’s normal group; (2) no neuropsychiatric disorder group; (3) emotional 
instability group; (4) anti-social personality group, with this last group having the 
highest mean scores. The profiles in Figure 1 for the three groupings of general 
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Figure ]. Prorites oF MEAN ScorEs ON THE MMPI For Turee DraGnostic Groups 
AND A NORMATIVE GROUP BY SCHMIDT 
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prisoners reveal a considerable similarity of profile in all types of maladjustment. 
The differences between the three prisoner groups are of degree rather than of kind 
in terms of the MMPI scores. Therefore, the “‘No NP disorder” terminology in this 
particular setting appears to be a relative matter; that is, the mild psychopath by 
comparison with his fellow inmates may appear “normal” and thus obtain a ‘No 
NP” diagnosis. 

In Table 2 are presented the t-values of the differences between the means of 
these four groupings. In examining the differences between Schmidt’s normal group 
and the No NP group, one may note that the psychopathic deviate scale is the most 
discriminating with a t of 8.24. The t-values for D and Hy are also very significant, 


TasBLe 2. T-VALUEs FoR DIFFERENCES ON MMPI Sca.es BETWEEN THREE PsyYCHIATRIC 
GROUPINGS AND DIFFERENCES WHEN COMPARED WITH SCHMIDTS’ NORMATIVE Group 
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although Hs interestingly enough does not reveal a significant difference. One might 
say that the relative significance of the D score might be explained in terms of a 
reactive type of depression which may be accounted for by confinement per se; that 
is, although the elevated D might normally appear in conjunction with the elevated 
Pd scores, it might also appear by virtue of a reaction to a period of incarceration. 
The Ma t-value is quite distinct also. This correspondence with high scores on the 
Pd seale is in accordance with what the authors of the MMPI have stated concern- 
ing these two-scales: 

“The hypomanic patient has usually gotten into trouble because of undertaking too many things: 
He is active and enthusiastic. Contrary to common expectations he may also be somewhat depressed 
at times. His activities may interfere with other people through his attempts to reform social practice, 
his enthusiastic stirring up of projects in which he then may lose interest, or his disregard of social 
conventions. In the latter connection he may get into trouble with the law. A fair percentage of 
patients diagnosed psychopathic personality (see Pd) are better called hypomanic.” 


In comparing Schmidt’s group of normals with the No NP group, one finds the 
least discriminating scales to be Hs and Pt. Apparently those prisoners who pre- 
sented no hypochondriacal or psychasthenic symptoms appeared ‘‘normal’’ in the 
prison setting. : 

In comparing Schmidt’s normal group with the Emotional Instability group, 
one observes that all of the t-values are significant at the .01 level of confidence. 
The three highest t-values are found in D, Pd, and Ma. It is likely that Pd and Ma 
are the really significant personality divergencies. As was noted in the discussion 
of the ““No NP” group, it is quite possible that there is considerable overlapping 
in these two diagnostic categories, Pd and Ma. Similarly, Hs is again relatively low. 

A comparison of Schmidt’s normals with the Anti-social Personality group re- 
veals that the Anti-social Personality group have significantly higher scores, as 
indicated by high t values. One might say that some of these prisoners are in- 
carcerated for anti-social tendencies and some for multitudinous activities, a few of 
which may lead to trouble. The hyperactivity of the hypomanic is sometimes of the 
degree that the individual will ultimately be in some type of conflict with his en- 
vironment sheerly by virtue of doing too many things, but it is most likely that this 
particular group most clearly fits the classification “ psychopath”. 

In sum, it can be said in comparing the three prisoner groups with Schmidt’s 
normals: (1) the prisoners are significantly more deviant; (2) all three categories of 
the prisoner’s classifications are considerably alike in their deviancy in terms of 
MMPI scales; (3) the diagnosis, “‘no neuropsychiatric disorder’, is apparently ¢ 
very relative term; (4) the most significant scales in terms of differentiation are 
Pd and Ma. 

In comparing the No NP group with the two groups having diagnoses of abnor- 
malities, one finds the only clinical scale which has t-values higher than 3 to be Pd. 
The t-score between No NP and Emotional-Instability is 3.11, and with Anti-social 
Personality, 5.69. The only other clinical scale at the .01 level of confidence for both 
comparisons is Sc. The significant feature of the differences in profiles here is that 
there are no statistically significant divergencies at all on the neurotic triad. Ap- 
parently psychoneurotic tendency is not one that enters into the differentiation. As 
a matter of fact, it is noted that the Hy and Hs scores are relatively low by 
comparison for all three groups. The psychiatric classification of “no neuropsychia- 
tric disorder”? appears to be a function mainly of presence or lack of behavior atti- 
tudes implicit in the Pd scale and possibly in the Se scale as well. 

In comparing the Emotional-Instability and the Anti-social Personality groups, 
one finds three scales revealing differences significant at the .05 level of confidence: 
Pd, Pa, and Ma. None of the differences is significant at the .01 level of confidence. 
Again, it is noted that the differentiation is apparently in terms of the degree of 
symptomatology. Those with the greatest amount of abnormality or deviation are 
classified in the Anti-social Personality group, the next in the hierarchy in the 
}-motional-Instability group, and those least deviated in the No Neuropsychiatric 
Disorder group. 
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CONCLUSIONS 
1. Army general prisoners deviate significantly in all clinical scales of the 
MMPI regardless of their psychiatric classification when compared with a “‘normal” 
group of soldiers; that is, when compared with Schmidt’s normal group, they have 
more neurotic, psychopathic and psychotic trends. 


2. MMPI results suggest that there is a general personality pattern some- 
what typical of all general prisoners with elevation particularly in the variables of 
psychopathy and hypomania. 


3. The writer suggests that if some inventory such as the MMPI were used to 
screen recruits for the Army those with elevated Pd and /or Ma scales (65 or higher) 
should have (a) a thorough psychiatric case report and (b) a subsequent thorough 
psychiatric interview before they could be considered suitable for enlistment. 
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STEEL MILL “HOT STRIP” ACCIDENTS AND INTERPERSONAL 


DESIRABILITY VALUES 
BORIS SPEROFF AND WILLARD KERR 
Psychometric Affiliates Illinois Institute of Technology 


Recent years have seen the extension of Moreno’s sociometric technique (an 
abbreviation of the psychophysical method of rank order) to many types of inter- 
personal situations“: *» ®, In the belief that the feeling of being ‘unwanted” by 
work associates may be conducive to accidents in industry, the following research 
was conducted. 


EXPERIMENTAL DESIGN 


Subjects. The 90 personnel participating in this study were 44 negro and 46 Spanish- 
speaking (Mexican and Puerto Rican) manual workers. Age range of subjects was 
21-42 among the Spanish-language group and 21-46 among the negro group. 

All subjects had had at least three years of steel mill experience of a manual 
type. Accidents over the last three years were compiled for each worker and these 
ranged from none through four for negro personnel and none through five for Spanish- 
language personnel. 

Setting. These 90 men were all employed on the finishing end of a hot strip in a large 
(16,000 personnel) Chicago-region steel mill. Each of the two racial groups worked 
in nine teams of from four to six men to a team. Each of the 18 teams was a co- 
operative, group-incentive paid working unit. These units included six cranemen- 
hooker teams, six shearing teams, and six cutting teams. 

Procedure. Within his racial group each worker was requested to name the other 
worker whom he ‘“‘ would most like to work with” and also to name the worker whom 
he “would least like to work with.’’ All men in each racial group knew each other, 
and since each racial group included nine teams, each man had a wide field from which 
to select in making his choices. 
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Some men received as many as six favorable nominations; others as few as 
none; and still others as many as seven “dislike” nominations. The formula used in 
combining ‘like’’ and “dislike” nominations was: 

Likes? — Dislikes* 
which yielded an interpersonal desirability value for each worker. These obtained 
values made a relatively normal distribution ranging from +36 to -49. Finally, 
they were plotted against the accident records of the same men. 


RESULTS 
Results of this study are displayed in Figure 1, which shows the scatterplot of 


individual cases. The computed Pearsonian product-moment coefficient of correla- 
tion for these data is —.54, which is highly significant statistically. 


Fic. 1. Soctometric ‘‘DEstRABILITY VALUES” OF Eacu oF 90 STEEL WORKERS AND THEIR 
THREE-YEAR AccIDENT Recorps (r = -.54). 
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INTERPRETATION 


The fact that the least interpersonally desired workers tend to have the most 
accidents immediately suggests two mutually contradictory interpretations. Do 
workers who are involved in accidents tend to become unpopular? Or, do persons 
who are disliked become preoccupied with worry and therefore increasingly liable 
to acciderits? 

On the basis of three months of full-time manual experience and observation 
in this steel mill, one of the authors is of the opinion that the typical reaction of 
fellow-workers to another who is involved in an accident is sympathetic rather than 
hostile. There appears to be no reasonable argument to support the first interpreta- 
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tion above. However, the second interpretation is not necessarily true either. It 
may be true at least in part—and the authors believe that it is. If it is true that 
interpersonal rejection causes worry which in turn causes accidents, then accidents 
probably can be reduced by the twin devices of (1) sociometric re-grouping of work 
teams, and (2) counseling of ‘‘rejected’”’ workers. These speculations are of course 
subject to experimental test. 

A third hypothesis—and one which will be favored by supporters of the con- 
stitutional proneness theory of accidents—is that some workers lack the perceptual 
and /or motor skills to handle either work dexterously or people diplomatically. 

Unfortunately, this limited experimental design does not provide any tests of 
these three hypotheses. It probably is safe to reject outright the accidents-cause- 
ostracism hypothesis, but the rivalry between the second and third hypotheses 
awaits further experimentation for settlement. It is not improbable that both the 
second and third explanations possess some validity. The authors believe that 
evidence eventually will show that more of the variance in accident rates is ac- 
counted for by the second than by the third hypothesis, however. 


SUMMARY 


Nine negro and nine Spanish-speaking teams (90 men) on the finishing end of a 
steel mill hot strip were studied with reference to accident records and interpersonal 
preferences. 

Workers most liked by fellow workers tended to be accident-free. Most of the 
high accident rates were experienced by the workers who were most disliked by their 
associates. Three explanatory hypotheses are discussed. Authors are of the opinion 
that such accidents may be reduced by (1) sociometric grouping, and (2) remedial 
counseling of ‘‘disliked” workers. 
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COMPENSATION AND THE CRIME OF PIDGEON DROPPING* 
JAMES T. BARBASH 


Eastern State Penitentiary, Philadelphia 


INTRODUCTION 


The extent to which compensatory mechanisms influence the form of anti- 
social behavior exhibited by criminals varies with the individual. Page states, 
“The lack of satisfying human relations, together with its resultant feelings of in- 
adequacy and deprivation, forces the individual to seek substitute satisfaction in 
delinquency.”” Carrying this concept a step further, it may be assumed that com- 
pensation, coupled with other factors, also influences the character or form of anti- 
social behavior. The other factors are manifold and inelude such facets as general 
environmental conditions, intelligence, sex °), and educational-vocational success “), 

In most cases the extent of compensation as a selective factor is either negligible 
or too deeply interwoven with other forces to be ascertained on a survey basis. This 
does not appear to be the case in a group of confidence men known as Pidgeon Drop- 
pers, Dragmen or Flim Flam artists. It is felt that they chose this form of anti-social 
behavior in an attempt to compensate for feelings of inferiority. The above terms 
are not usually synonymous, but in this case, are interchanged for convenience, and 
are meant to designate a specific clan of confidence men. They should not be con- 
fused with other more numerous varieties of swindlers. 

Pidgeon Dropping is the old pocketbook game, in which two accomplices trick 
money from the victim by offering to share with him the alleged contents of a pur- 
portedly found pocketbook. This is accomplished by demanding a sum of money 
from the dupe to prove that he is a ‘“‘responsible person” who could pay back the 
‘find’’ should the loser be located before the ‘‘legal waiting period of ninety days”’. 

Short Con is a term used to deseribe confidence games which can be executed 
quickly, and often involves the ancient Shell and Pea game, Three Card Monte, or 
a complicated short change routine used on store clerks. 

The Wipe is a hoax in which confederate A states that he is going to take con- 
federate B (Lame Man) to a brothel, and as a precautionary measure requests the 
victim to hold their money. To insure the dupe against carelessness he is required 
to wrap his own money, as well as that of the accomplices, in a handkerchief. This 
in turn, is to be carried inside the shirt. Needless to say, a substitution is made and 
the dupe, believing that he has victimized the others, journeys home, carefully carry- 
ing a handkerchief filled with worthless newspaper. 


MertTHOD 


This study was conducted at the Eastern State Penitentiary at Philadelphia, 
Pennsylvania. It was necessary to examine the records of 6800 convicts in order to 
locate twenty-five true Flim Flammers, all of whom proved to be members of a racial 
minority. The surprisingly small number of confidence men available, coupled with 
the consistency of the findings, at first fostered the belief that these 25 cases repre- 
sented a fairly adequate sample. However, after communicating with other institu- 
tions, it appears that a larger number of Flim Flam receptions are encountered else- 
where and with also a different ethnic composition of the group. These findings tend 
to limit the significance of the present results. In the current study, numerous cases 
of swindlers were eliminated because they involved other forms of confidence games 
or failed to show consistent Pidgeon Dropper’s patterns. 

The subjects were examined in respect to intelligence, education, personality 
and social history. The tests used were the Scovill Classification Test, the Stanford 


*The author is indebted to Dr. Frederick H. Lund, Department of Psychology, Temple Univers- 
ity, for his guidance and suggested interpretations of the survey material. 
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Achievement Test, and the Woodworth Personal Data Record. Subsequently, the 
examinees were interviewed on three separate occasions, two interviews being used 
to gather social material and the third for clinical evaluation and diagnosis. 

The educational achievement test showed a mean grade equivalent of 3.5. 
Three men were considered illiterate. The highest grade reached was the seventh. 
Strangely enough, language usage proved to be no stronger an accomplishment that 
the other subjects tested. 

On psychometric examinations, all 25 men were of lower than average intelli- 
gence. The mean IQ was 62 with a mean mental age of 9 years 3 months. Fourteen 
cases tested as mentally defective, 9 as borderline, and 2 as low average. The 
selective factor of having been apprehended undoubtedly contributes a sampling 
error as in other crime surveys. However, several fundamental factors must be con- 
sidered. First, the intelligence of prison populations compares favorably with non- 
criminals, “> 2+ 3+ 4,5) Seeond, unlike most victims of other forms of felony, the Drag- 
man’s dupe is in contact with him long enough to memorize his characteristics and 
thereby to make possible more positive identification. Finally, 1Q’s might have been 
higher if measured by the Wechsler-Bellevue Scale. By the same technique how- 
ever, so too would be the mean of the prison population. 

Results on the Woodworth Personal Data record showed no consistent pattern. 
Frequently the men appeared emotionally immature but did not express the usual 
neurotic complaints. 

RESULTS 

1. Of the twenty-five cases, twenty-four came from low socio-economic rural 
areas to compete against a more sophisticated culture. 

2. More often than not, the Artist referred to his larcenous behavior with un- 
hidden pride, contemptuously spoke of violence as being beneath his dignity, and 
considered himself professional. It is interesting to note that the term ‘professional 
criminal” usually implies a rather high degree of mental efficiency. Barnes and 
Teeters“) state: ‘The professional criminal is alert, usually highly intelligent and 
carries all the earmarks of a business man’’. As will be indicated subsequently, such 
is not the case with the Pidgeon Droppers surveyed here. 

3. As a rule, he is known personally, or by reputation, to his fellow artists 
throughout the country. If separated from a partner, for such a reason as police 
action, he moves to the next metropolis, where he seeks out what he calls the “‘smart 
money set’. Within a short time, he is established as “‘all right”, and is offered work 
by fellow clansmen. 

4. The Flim Flammer does not make use of pretentious equipment. A hand- 
kerchief, a pocketbook, a newspaper, and occasionally a deck of cards or nut shells 
are his only tools. His choice of victim is made by observing from such vantage 
points as bars, betting establishments, banks and even churches. <A likely looking 
gull seen to flash money, or known to have money banked, is “‘sounded” to deter- 
mine the sufficiency of his financial status and his unfamiliarity with the trick. 

5. His desire for exclusiveness, as well as self-protection, have caused the 
Dragman to gather an intriguing vocabulary. He has borrowed or invented such 
words as boodle, short con, hypie, lame man, stram, wipe, cap man, sound, front 
man and pidgeon dropping. 


6. He persists in routines which rarely change and is forced to live a semi- 
nomadic life. 


7. Rationalization is seen, and compensation implied, when the Flim Flam 
Artist states, ‘‘I couldn’t of beat (tricked) him if he hadn’t had larceny in his heart.” 
This statement is made with such frequency as to assume the proportions of a creed. 

8. Persistent nonadaptation as a criterion of intelligence was seen by the high 
number of arrests per man. The mean number of arrests was over twenty-one, with 
the range running from six in seven years, to forty-eight in less than thirty years. 
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These figures do not include the number of near arrests each man had. Most of these 
were not for Flim Flam. However, the great quantity must be considered as in- 
dicative of inability to make satisfactory life adjustments. Several of the subjects 
claimed that only three out of ten attempted routines were successful. The other 
seven ended in either arrest or a prodigious amount of talking to prevent incarcera- 
tion. Even this talking, which is usually directed toward a hostile victim, involves 
a set pitch that rarely changes. 

9. Many openly expressed enjoyment of their swindles because it involved 
living by their wits and outsmarting those who had attempted to ‘‘beat’”’ them. Most 
of the Flim Flam enticements involve the ‘something for nothing’ motive with un- 
ethical or dishonest gains at the supposed expense of others. 

10. Itisinteresting to note that class distinction exists even among accomplices. 
Those scoring lowest on the tests most frequently played the “‘lame man” (pretended 
victim), while those scoring higher considered themselves unsuited for this part. 

DIscussIOoN 

Because of sampling errors and other methodological inadequacies, the results 
of this study should be considered as tentative rather than conclusive. However, 
the consistency of the social data indicates that compensatory factors appear to 
play an important role in the personality dynamics of the type of swindler known 
as a Flim Flam Artist. 


SUMMARY AND CONCLUSIONS 


1. Twenty-five convicted Flim Flam Artists were surveyed in an effort to 
evaluate the theary that compensation for feelings of inferiority acts as a selective 
factor in the choice of Flim Flam as an anti-social outlet. 

2. Through testing and interviewing the following facts were established: 

a. All those surveyed belonged to a minority group. 

b. A majority of subjects had come from low socio-economic rural areas 
to compete against a more sophisticated culture. 

ce. All preferred to “live by their wits” rather than work or steal by other 
methods for a living. Many of the men openly expressed pleasure in 
having inteliectually bested their victims. A preponderant number 
considered themselves “‘ professional’’, are clanish, use mystical termin- 
ology and persist in largely unsuccessful, nonadaptive, repetitious lar- 
cenous behavior. 
Educationally, the mean grade equivalent for the Pidgeon Droppers in- 
volved was 3.5 grades. Language usage proved to be no stronger than 
the other Stanford Achievement Test areas. 
On the basis of intelligence testing, all the subjects exhibited below aver- 
age intelligence. The mean IQ was 62. Fourteen scored as defective, 
nine as borderline, and two as low average. Test results appear con- 
firmed by the frequency of persistent non-adaptive behavior, low level 
of achievement, general immaturity, class distinction among Flim Flam 
accomplices and comparison with other inmates. 

3. The survey findings indicate the probability of compensation as a selective 
factor in the Flim Flam behavior of the cases surveyed. 
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THE PARENT AS A RIVAL SIBLING 
RALPH F, BERDIE 


Student Counseling Bureau, University of Minnesota 


Psychological literature frequently refers to the phenomenon of sibling rivalry. 
The complex behavior pattern elicited upon the introduction of a new sibling into 
the home contains many components, but the rivalry aspect is so pronounced that 
it has been approached as the dominating aspect of this behavior. Psychotherapists 
perhaps have demonstrated most clearly the existence of this behavior, but a re- 
latively sophisticated but untrained observer is able to discern rivalry reactions in 
almost every family upon the birth of a younger brother or sister. Pediatricians, 
child psychologists and child psychiatrists have been most concerned with this 
aspect of the child’s behavior, and discussion has usually been directed toward be- 
havior problems directly resulting from the rivalry reaction, usually within a short 
period of time after the birth of the new child. Clinicians also have reported the 
continuation of this sibling rivalry into later childhood, and jealousy reactions to 
siblings no longer infants have been observed, perhaps, as frequently as the reaction 
to the newborn sibling. Relatively little attention has been given, however, to the 
influences of sibling rivalry upon adult behavior and what may be even more im- 
portant, upon the attitudes of parents toward their children. It is odd that this 
particular aspect of parental dynamics has not been given greater attention in light 
of the extensive discussion in recent decades concerning the importance of the 
Oedipus and Electra patterns. 

The purpose of this paper is to suggest that the attitudes of sibling rivalry, 
and other attitudes demonstrated by children toward their siblings, are influential 
in determining the attitudes of parents toward their own children. The younger 
child who experiences great fear and antagonism directed toward his older, more 
aggressive brother, may in turn demonstrate similar attitudes toward his oldest 
son. Similarly, a father who is himself an older son and who, during childhood, 
experienced intense rivalry feelings toward his younger brother might provide undue 
support to his older child at the expense of his younger children, primarily because 
he himself is able to identify with the older child or, perhaps, he identifies the young- 
er child with his own younger sibling. Two rather brief and relatively simple case 
histories might serve to demonstrate these phenomena. 


Case 1. Mr. X was four years old when his younger brother, an only sibling, was 
born. During the first few years in the new child’s life, the older brother showed 
no overt aggressiveness toward the child but was somewhat delayed in terms of bed- 
wetting and chewed his fingernails until the age of 11 or 12. His speech developed 
normally, his reading skills in school and other academic skills developed normally. 
He was somewhat retiring socially but not excessively so and showed no other nervous 
mannerisms or behavior deviations. 

The two brothers did not play much together and spent a great deal of time 
arguing and fighting. As Mr. X looks back upon his childhood, his primary re- 
collection of his attitude toward his younger brother was one of intense irritation. 
As the brothers grew older, however, they developed a warm and friendly relation- 
ship. When Mr. X was married, the pattern in his own family repeated that in his 
father’s family. The birth of his oldest son was followed three and one half years 
later by the birth of his second son. Although Mr. X demonstrated a great deal of 
interest in his oldest child from the time of birth and spent much time with that 
child and gave the child much attention, the birth of the second child did not elicit 
this same response. As the children grew older, Mr. X was unable to demonstrate 
the kind of sympathy with the second child that he was able to show witb the first. 
Misbehavior, which he was able to treat rationally with the first child, aroused an 
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emotional response when shown by the second child, and Mr. X felt that perhaps 
he would never be able to establish a satisfactory relationship with his youngest son. 

In discussing this problem with a clinician, it became apparent to Mr. X that 
he was demonstrating the same attitude toward his youngest son that he had pre- 
viously demonstrated toward his younger brother. Perhaps years of conditioning 
during childhood had made the arousal of the emotional responses alraost automatic, 
and the youngest son’s behavior provided the stimulus. After two one-half hour 
therapeutic sessions, Mr. X expressed excellent insight into his reactions to both his 
older and his younger sons. He was later able to control his own emotional reactions 
to his younger son and gradually develop a strong positive feeling toward that child, 
who in turn responded to this change in parental attitude and became a much more 
lovable child. 


Case 2. Mrs. Y was the younger of two sisters. During childhood, her older sister 
exercised much authority over Mrs. Y. As the children grew older, both sisters 
developed into very attractive women, but Mrs. Y always felt that as a child her 
older sister always had been more attractive than herself and far more skilled socially. 

After Mrs. Y was married she had two daughters about the same difference in 
age as she and her sister were. As the children reached the pre-school and kinder- 
garten age, Mrs. Y found herself showing great favoritism toward the younger child 
and actually resenting much of the normal behavior of the older daughter. Much 
of the oldest child’s behavior was interpreted by Mrs. Y as threatening the younger 
child, and the younger child soon began to pick up this attitude and in turn resent 
her older sister. 

Mrs. Y discussed this problem with a clinician and it became apparent to Mrs. 
Y that she was identifying herself with the youngest daughter, and in turn, identify- 
ing her oldest daughter with her older sister. After recognizing some of the sources 


of the attitudes she had been showing toward her own children, she was able to 
establish separate roles for these children and not regard them as playing the roles of 
herself and of her sister. The favoritism disappeared, and the antagonism toward 
the oldest daughter no longer was so easily aroused. 


SUMMARY 


The behavior described here can be explained through the use of basic learning 
concepts. Perhaps the parent as a child has learned certain reactions to siblings, and 
when the original situations are reasonably accurately repeated, the original be- 
havior pattern tends to reoccur. This is an oversimplification and only one of several 
possible explanations. If the parent has satisfactorily worked out his relationships 
with his siblings, as the two parents described here apparently had, an understanding 
of this phenomenon might be sufficient to alter the attitude of the parent toward the 
child. On the other hand, if the parent’s attitudes toward his sibling never have been 
satisfactorily worked out, an understanding of the sources of his parental attitudes 
may not be followed by a change in his behavior toward the child. 





THE INFLUENCE OF COLOR ON THE CONSISTENCY OF RESPONSES 
IN THE RORSCHACH TEST 
ROBERT M. ALLEN, SIGMUND H. MANNE, AND MARGARET STIFF 
The University of Miami 


INTRODUCTION 


Studies with the Rorschach are concerned mainly with the reliability of the total 
test, and not with the effect of any one determinant on the retest reliability. Bell 
indicates that in a number of studies the conclusions vary so greatly that the question 
of statistical reliability is still unanswered. Mons©: ?- ™ states: “That the record 
of each individual personality is characteristic can be demonstrated by a series of 
retests after a number of years, or every six to twelve months . . . Experience shapes 
and matures, it develops and restrains, but the fundamental personality remains . . . 
On the average some 20-309 % of the responses arg identical or similar on second test 
after a year, and at least half of these are unusual ones.”’ This raises the question 
of the influence of color on the consistency! of responses in the Rorschach. 


MetTHoD AND RESULTS 


The data for this paper are taken from the protocols of 25 students at the 
University of Miami comprising the normal group in a tri-dimensional study. ©: *” 
The subjects were individually tested and retested alternately with standard or 
chromatic plates (C series) and with a specially prepared set of achromatic (A series) 
Rorschach cards.* The essential difference between C and A series is the absence of 
color in the A series plates 2, 3, 8, 9, and 10, as compared with the standard colored 
plates in C series. The time between test and retest was six weeks. 

The purpose of this paper is to report the influence of color on the consistency 
of responses as evidenced in the protocols of 25 normal college students. ‘The 13 
students in the C group and 12 in A group were tested and retested with the C and 
A plates in AB-BA order. 

Test and retest responses were compared for each plate to determine intertest 
consistency. A response was considered consistent for test-retest if the position of 
the card was the same, the identical area was used, and the same or similar wording 
appeared in both test-retest responses. The mean per cent of consistency was com- 
puted for each card of the A and C series. The final computation included the mean 
per cent of consistency for: (1) the color cards (2, 3, 8, 9, 10) for C and A groups; 
(2) the non-color cards (1, 4, 5, 6, 7) for both groups; and (3) all ten cards in C and 
A groups. 


1. The mean per cent consistency for the colored cards of C series is 30.4% 
while the same statistic for the same cards in A series is 27%. The significance of 
the difference yields a ¢ of .6, p>.d. The Null Hypothesis remains tenable at this 
level, suggesting that the presence and absence of color in the usually colored cards 
does not affect the consistency of responses for plates 2, 3, 8, 9, 10. 


2. The mean percentage of consistency for the non-color cards of the C series 
is 34.67, for A series 30.6°,. This reduces to a t of .5, p>.5. This also indicates 
that the presence and absence of color in the Rorschach plates does not seem to bias 
the consistency of responses for plates 1, 4, 5, 6, 7. 

‘This term was suggested by Dr. C. H. Sievers and is defined as the reappearance of a response in 
the retest protocol. 

*The A series cards were printed by Verlag Hans Huber under the following instructions: “Use the 


same presses, the same black ink used in printing the standard Rorschach cards, and the same press- 
“ 
ure. 
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3. Analysis of the responses to all ten cards in both series results in a mean 
consistency percentage of 32.5°, for C and 28.8°, for A series. The t of .6 gives a 
p> .5. The inference is the same as above. 


CONCLUSIONS 


1. The consistency of responses for test-retest protocols of the ink blots ranges 
from 27°, to 35°, for the normal population studied. 


2. Mons’ findings for per cent of consistency are supported. 


3. There is no statistically significant difference in the consistency of re- 
sponses with chromatic and achromatic cards so that the presence and absence of 
color does not seem to have an influence on the degree of consistency of responses 
in the Rorschach Test with a normal population. 
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EDITORIAL OPINION 





PASTORAL COUNSELING OR COUNSELING PASTORS? 


One of the most gratifying developments of the past decade has been the widen- 
ing of the scope of the mental hygiene movement to include among its practitioners 
members of all the professions having a legitimate interest in the field. Although the 
medical profession operating through the specialty of psychiatry early assumed a 
position of dominance in opening up and developing such mental hygiene activities 
as institutional care, child guidance, psychiatric clinics for adults, etc., the neighbor- 
ing professions of the ministry, clinical psychology, social work and education were 
not slow to recognize their potentialities for making contributions in the non-medical 
areas of the field. In our opinion, this widening of participation in the field has been 
a very healthy development in the sense that the mental health field is too large to be 
the exclusive responsibility of any profession. There are many aspects of mental 
health work which can be accomplished more optimally by nonmedical personnel 
if for no other reason than to get a wide and representative group of workers interest- 
ed in the field. Thus it was inevitable for pastors interested in helping parishioners 
with personality problems to turn to pastoral psychiatry and psychology in order 
to become proficient in the art of counseling. It was also inevitable that the differ- 
ing ideological backgrounds of medical psychiatry and theology would produce areas 
of conflict and disagreement due to the quite different foundations of the two schools 
of thought. It is therefore not surprising to discover representatives of organized 
religion denouncing some of the theoretical and applied aspects of psychiatry and 
psychology (such as Freudianism). And it was also predictable that organized theol- 
ogy would attempt to construct systems of pastoral counseling consistent with their 
doctrines of the nature of spiritual life. Organized religion is dedicated to the prin- 
ciple that it has something to offer over and above the subject matter of the natural 
sciences. 

This question of whether the spiritual approach possesses any advantages over, 
or can accomplish anything more than, the proven methods of medical and psycho- 
logical science is a crucial one in determining whether we should speak of pastoral 
counseling or of counseling pastors. This question cannot be solved by argument 
or claims however authoritative, but will be settled only when sufficient research 
evidence is collected to determine exactly what is being accomplished by the different 
methods. Fortunately, there is a methodological pattern already established for 
the evaluation of claims and counterclaims made in support of healing methods. 
The medical profession has had a long experience in evaluating therapeutic claims 
and the current issue of this Journal contains a symposium consisting of papers dis- 
cussing the current status of research methods in this area. The questioning attitude 
of the medical or psychological scientist who has learned by bitter experience to be 
skeptical of new and unproven claims will therefore be understood by newcomers to 
the field. Pastoral counseling is such a newcomer entering the field of therapy and 
making claims of successful healing through use of spiritual methods, and it may 
expect that the longer established healing arts will look askance until the new meth- 
ods have been proven valid by scientific experiment. Indeed, rather than feeling 
resentful or defensive about questioning attitudes from medical and psychological 
scientists, pastoral counselors should weleome any method of investigation which 
could demonstrate that they could make unique contributions over and above the 
established methods of psychotherapy. If pastoral counseling does have any unique 
distinctive contribution to make, it is of great importance to have the nature of this 
contribution objectively studied so that it can be swiftly made available to mankind. 

The present status of pastoral counseling is one of considerable ideological and 
theoretical confusion, questionable assumptions and unproven claims. The ideol- 








100 EDITORIAL OPINION 


ogical confusion appears to stem from the apparent need of pastoral counselors to 
resolve inconsistencies between their theological beliefs and the theories of modern 
psychological science. In order not to be inconsistent with theological doctrine, it 
has been necessary to make many assumptions which do not appear to be consistent 
with the available scientific data. Unfortunately, there have been few scientific 
studies objectifving the nature of pastoral counse ‘ing (as distinguished from scienti- 
fically oriented therapy) and no one has collected any statistical data concerning 
its efficacy. A survey of papers published in the Journal of Pastoral Care and in 
Pastoral Psychology reveals that most of the material is either purely theoretical or 
empirically based only on anecdotes. It must be concluded that pastoral psychology 
is still in a prescientifie stage of its evolution, and that no definite evaluation may be 
made on the basis of existing evidence. One of the best evidences for this contention 
exists in the semantic confusion which currently exists in the field of pastoral psy- 
chology. Semantics is the study of the language of science. In order to deal ob- 
jectively with phenomena, it is necessary to adopt a semantically valid set of words 
or symbols with which to describe the data. The basis of science is to establish ob- 
jective units of measurement. If anything exists, it must exist to some degree, and 
therefore is measurable. The whole field of theology in general, and pastoral counsel- 
ing in particular, reflects a semantic confusion which almost renders any scientific 
study impossible until its basic concepts can be restated in terms which are semantic- 
ally valid and susceptible of experimental and statistical evaluation. And similarly, 
much of the contention and misunderstanding which currently exists between 
theology and psychological science will automatically be dissipated when a common 
ground of semantic validity and scientific evaluation of data is established. At 
present, most of the veiwpoints being expressed by all participants must be recog- 
nized as reflecting “simply personal opinion and speculation unsupported by scienti- 
fic evidence. This state of affairs is not to be considered as any reflection on the field 
as a whole since it simply reflects the youth of the movement. But serious damage 
will be done if the leaders are too complacent and become satisfied with the status 
quo. A great deal of progress has been made thus far in interesting the ministerial 
profession in problems of mental health and personality counseling, and a good start 
has been made in getting the new movement launched. But the courtship is now 
over, and whether the movement will sink or swim will depend on the degree to 
which it ean be gotten on solid foundations. A start must be made on beginning a 
scientific study of pastoral counseling to discover its nature, wherein it differs from 
classical scientific methods, and what its unique contributions are. This can be ae- 
complished if recognized experimental and statistical methods are introduced into 
the field of pastoral counseling. Clergymen themselves will rarely have the scientific 
training to accomplish such a program but it should not be difficult to secure the co- 
operation of competent scientists. 

In the meantime, until any unique contributions can be objectively demon- 
strated, medical experience suggests that it will be wise to adopt a critical and 
questioning attitude. Based on the comparative history of other clinical professions 
and upon our general knowledge of what is and is not possible in psychotherapy, it 
appears that spiritual care and psychological healing are two distinct fields with 
only slight overlap. While psychological methods may help with spiritual problems, 
and vice-versa, it appears that the two areas are distinct and that each has its own 
appropriate methods. It appears safest to assume that Clergymen should distinguish 
carefully between their spiritual and psychotherapeutic functions, viewing them- 
selves as counseling pastors rather than pastoral counselors. Pastors are, of course, 
free to utliize any of the standard methods of personality counseling as may be in- 
dicated with any client, but the scientific application of the Law of Parsimony re- 
quires them to be very critical of the interpretations of their results. 

The area in which most of the conflict currently exists concerns the question of 
the role of morality and ethics in personality counseling. Clinical psychology and 
psychiatry have taken the position that moralism has no place in the healing func- 
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tion, i.e. that the client must be accepted noncritically and nonjudgmentally for 
what he is. It is believed that any tendency to be moralistic only succeeds in placing 
the client on the defensive and tends to block therapeutic progress. The counseling 
relationship is viewed as one in which healing is the major objective, and in which 
the client must feel perfectly safe. Pastoral counseling, on the other hand, has con- 
cerned itself with the concepts of Sin and Salvation on the assumption that the 
client will be helped if he can be brought into a closer relationship with God. The 
pastoral counselor views himself as a representative of God. entrusted with the duties 
of saving souls and leading men to see the Light. Thus it is inevitable that moralism 
plays a large part in pastoral counseling with the counselor imposing an external 
system of values upon the client in a very directive manner. It is not the purpose of 
the present commentator to pass any judgments upon these different ideological 
view points since there is insufficient evidence to do so. But it is within our province 
to exhort all concerned to embark on research programs to clarify the issues at stake. 
It is only from more extensive knowledge that better work will come. 
F.C. T. 


/ 





IN MEMORIAM 





SAMUEL W. Hamittron, M.D. 
L878 - 1951 


Dr. SAMUEL W. HAMILTON joined the editorial board of this Journal as one of 
its charter members in January 1945. At that time he was one of the busiest and 
most influential psychiatrists in the United States, becoming President of the 
AMERICAN PsycHtaTric ASSOCIATION in 1946-1947. In order to understand the 
scope of his interests and influence, it is necessary to review his professional career 
and appointments. Dr. HAmILron’s major interest was the improvement of insti- 
tution care for psychiatric patients. As mental hospital consultant first for the 
NATIONAL COMMITTEE FOR MENTAL HyGIENE and later for the UNrrep StrarEs 
Pusiic HEALTH SERVICE, he visited and evaluated almost every large public and pri- 
vate mental institution in the United States and Canada. Through this work he came 
to know and be known by most of the prominent psychiatrists of his time, and thence 
came his influence on the psychiatric scene. Unlike many psychiatrists of his day, 
Dr. HAMILTON always showed a constructive and beneveolent interest in the develop- 
ment of clinical psychology as a profession. In his work as mental hospital consult- 
ant, he never lost an opportunity to recommend the addition or expansion of clin- 
ical psychology facilities in mental hospitals. At times when clinical psychology as 
a profession was under attack from some of his colleagues, Dr. HAMILTON stepped 
in with a tolerant word or action to blunt the full force of the attack. During our 
contacts with him as a member of our editorial board, he was always constructive 
and helpful even during moments when the going was rough. He encouraged us in 
our determination to be aggressive and bold in editorial policies, saying: ‘‘ Don’t try 
to please everybody. You can’t be active and not step on somebody’s toes.’’ At 
the same time he served as a tempering influence because of his essentially conserva- 
tive approach to life. Clinical psychology has lost an able supporter. The Journal 
of Clinical Psychology has lost a staunch colleague. We have lost a very good per- 
sonal friend. 


F.C. T. 
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SHNEIDMAN, E. 8. Thematic Test Analysis. New York: Grune & Stratton, 1951, pp. 

320. 

This book presents the results of an interesting study in which fifteen clinical 
psychologists contributed blind interpretations of test results of a single subject on 
the Thematic Apperception Test and the Make A Picture Story Test. The collected 
interpretations are compared with additional test and behavioral data, and also sub- 
jected to preliminary statistical work-up. The results are significant in that they 
make possible a comparative study of the work of various ‘‘experts’’. 


Harrower, M. R. and Srermer, M. FE. Large Scale Rorschach Techniques. Spring- 
field, Ll.: C. C. Thomas, 1951, pp. 353. $8.50. 


This is a revised second edition of the well known manual for the administration 
of Group Rorschach and Multiple Choice Tests. 


Backus, OLiie and BEASLEY, JANE. Speech Therapy with Children. Boston: Hough- 
ton Mifflin, 1951, pp. 441. $3.00. 


This is one of the first dynamically oriented texts on speech therapy with child- 
ren. The authors are staff members of the University of Alabama. Abandoning the 
older orientation-which overemphasized the mechanics of speech and correctional 
devices, this book stresses the importance of the therapeutic relationship, i.e. to pro- 
vide corrective emotional experiences in a non-segregated group situation. Their 
methods are reproduced verbatim with a great deal of illustrative materials. This 
book will make a very definite contribution to the field. 


SECHEHAYE, M. A. Symbolic Realization. New York: International Universities 

Press, 1951, pp. 184. $3.25. 

In the method of “symbolic representation’’, the therapist establishes a relation- 
ship with the client using symbolic signs and gestures in order to secure representa- 
tion of conflicts which had arisen on preverbal levels of development and which are 
therefore not accessible by the orthodox verbal method of psychoanalysis. This 
monograph consists of the detailed presentation of a case which had been diagnosed 
as hopelessly schizophrenic and which had been inaccessible to standard methods. 
By symbolic representation of conflicts and needs, the therapist appears to have 
established a successful therapeutic relationship when all else had failed. 


Kiem, D. B. Abnormal Psychology. New York: Henry Holt, 1951, pp. 589. $4.75. 
The author is Director of the Psychological Service Center and Professor of 
Psychology at the University of Southern California. The publication of a new text 
in the field of abnormal psychology is always an event which stimulates the hope that 
its author will sueceed in achieving a new orientation of theories and facts which will 
reduce the confusion and omissions which have existed in the past. This book suc- 
ceeds in encompassing practically all of the current theories and facts of clinical 
psychology, but their manner of presentation in spots leaves much to be desired. 
Thus there are discussions of the more complex methods of therapy such as psycho- 
surgery under such paragraph headings as ‘‘ What price lobotomy?’’, and ‘Can psy- 
chiatrists agree?’’. Much of the material is presented in terms of the author’s opin- 
ions rather than in terms of actual research results. Presumably this book is intended 
for undergraduate and graduate students of abnormal psychology, and in which case 
the students will probably become terribly confused by technical discussions of the 
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fine points of psychopathology, psychodiagnosties and psychotherapy which are still 
under discussion by the experts. Although valuable, this does not seem to be the 
book to end all books on abnormal psychology. 


Winter, J. A. A Doctor’s Report on Dianetics. New York: Julian Press, 1951, pp. 

221. $3.50. 

In spite of its shaky scientific foundations, dianetics has attracted the interest 
of a considerable number of physicians and other scientifically trained professional 
men. The author is a practising physician who became interested in dianetics and 
became the first Medical Director of the Hubbard Dianetic Research Foundation. 
It appears that Dr. Winter became very critical of the program of the foundation 
but not to the degree that he abandoned the method. Instead he resigned from the 
position and wrote a book presenting his own interpretation of the mechanisms of 
dianetics. He presents his own opinions of the methods together with examples of 
the applications of dianetics. There may be some unique contributions of the dian- 
etic method but this book fails to reveal what they are. Instead, the work has the 
aroma of quackery. 


BRACHFELD, Ouiver. Inferiority Feelings. New York: Grune & Stratton, 1951, 
pp. 301. 


The author attempts to present all available information on the historical de- 
velopment and contemporary status of the concept of the inferiority complex. He 
has succeeded in doing a competent job of library research. 

IscHLtonpDsky, N. E. Brain and Behavior. St. Louis: Mosby, 1949, pp. 182. $7.00. 

This monograph presents the author’s neurophysiological research on the 
phenomena of induction as a fundamental mechanism of neuro-psychic activity. 
Some of the experiments cited would make good exercises for a course in advanced 
experimental psychology. 


GouHEEN, H. W. and Kavruck, 8. Selected References on Test Construction, Mental 
Test Theory, and Statistics, 1929-1949. Washington: U. 8. Government Printing 
Office, 1950, pp. 209. $1.50. 


This monograph includes 2544 selected references on all phases of test theory 
and construction and should be invaluable for research workers. 


DiMicuakt, 8. G. (Ed.). Vocational Rehabilitation of the Mentally Retarded. Wash- 
ington: U. 8. Government Printing Office, 1950, pp. 184. $.45. 


A distinguished group of authors contribute chapters on various aspects of the 
vocational rehabilitation of the mentally defective. This monograph is intended as 
an orientation manual for vocational counselors. It contains much material for the 
student and beginning worker. 


Wo.trr, Werner. Values and Personality. New York: Grune & Stratton, 1950, 
pp. 239. $4.75. 


This book is subtitled ‘An Existential Psychology of Crisis” referring to the 
tendency of individuals to become embroiled in the world crises of their times. Fol- 
lowing the new philosophical school of ‘existentialism’’, Wolff attempts to describe 
and classify psychological phenomena according to their existential significance. 
It is hypothesized that man develops existential conflicts on becoming aware of his 
freedom of decision and the resulting responsibility. Each person shows a pattern of 
attitudes and reactions which reveals his existential design, and an existential target 
or ultimate goal which determines his actions. It is therefore important to study 
the existential reality which consists of values rather than facts, and which is the real 
substance of psychology. 
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GinvEs, Bernard C. New Concepts of Hypnosis. New York: Julian Press, 1951, 
pp. 262. $4.00. 


The author is a physician who presents rather elementary discussions of the use 
of hypnosis as an adjunct to psychotherapy and medicine. 


Sau, Leon J. Bases of Human Behavior. Philadelphia: Lippincott, 1951, pp. 150. 


A series of general essays by a Professor of Clinical Psychiatry at the University 
of Pennsylvania School of Medicine. 


ScuHitperR, Paut. Brain and Personality. New York: International Universities 
Press, 1951, pp. 136. $2.50 Second printing. 


Mykuesust, Hetmer R. Your Deaf Child. Springfield, Ill.; C. C. Thomas, 1950, 
pp. 133. $2.50. A Guide for Parents. 


Kastus, Cora (Id.) A Comparison of Diagnostic and Functional Casework Con- 

cepts. New York: Family Service Association of America, 1950, pp. 169. 

This monograph consists of a report made by the Committee to Study Basic 
Concepts in Casework Practice of the Family Service Association of America. It 
consists of theoretical presentations of diagnostic and functional viewpoints in case 
work following the viewpoints of Freud and Rank. Detailed presentations are in- 
cluded of a case treated by each method. 


Scuarer, Hans. Religion and the Cure of Souls in Jung’s Psychology. New York: 
Pantheon Books, 1950, pp. 221. $3.50. 


This is another of the Bollingen Series written by a protestant theologian who 
finds Jungian psychology of value in explaining how religion can provide a way to 
spiritual healing and health. It consists of a series of essays relating Jungian psy- 
chology to the situation of religion today. 


Bauman, Mary K. and Hayss, Samuget P. A Manual for the Psychological Exam- 
ination of the Blind. New York: Psychological Corporation, 1940, pp. 58. 


Brun, Rupour. General Theory of Neuroses. New York: International Universities 
Press, 1951, pp. 469. $10.00. 
Dr. Brun is Professor of Neurology and Neurobiology at the University of Zur- 
ich. It consists of twenty two essays or lectures on the nature of the neuroses, psycho- 
somatic relationships, the theory of instincts and mechanisms of symptom formation. 


Swami, AKHILANANDA. Mental Health and Hindu Psychology. New York: Harper, 
1951, pp. 231. $3.50. 
Hindu psychology (or rather philosophy) is concerned with the actualization of 
the inner possibilities of man through ennobling religious experiences. This book 


consists of a series of essays purporting to show how mental power can be attained 
through Hindu religious practices. 


Moorg, Merritt. Clinical Sonnets. New York: Twayne Publishers, 1949, pp. 72. 
$2 50. 


A leading psychiatrist presents effective pictures, in verse, of persons and their 
reactions to life. 
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> Invaluable tools for 
pe and teachers 


Thenette Apperception Test 


Bv Henry Alexander Murray and the Staff of the 
Harvard Psychological Clinic 


REVISED EDITION 


method of revealing to the trained interpreter the dominant 

drives, emotions, sentiments, complexes, conflicts and hidden 
inhibited tendencies of a personality. 30 pictures on 9” x 11” cards 
together with a manual on the administration of the test and on 
the analysis and interpretation of the results. 


‘“‘An important advance in testing emotional and subconscious 
elements.’”’—The Nervous Child 
Test, including Manual $5.00 


Thematic Apperception Test 


THOMPSON MODIFICATION 
By Charles E. Thompson 


se of the Thematic Apperception Test with members of the 

Negro culture group and with certain white patients reveals 
that some of the test pictures are differently interpreted or have 
no significance for them. Dr. Thompson has supplied new pictures 
for 21 of Dr. Murray’s originals, and a new manual. 


Test, including manual, $6.00 


Order From 
HARVARD UNIVERSITY PRESS 


44 FRANCIS AVENUE CAMBRIDGE 38, MASS, 











000 OO 0 0 0 00000 OO OOO OO OOO OOO OOO OOO Ue a 




















CLINICAL 
STAFF 


MEDICAL STAFF 
OF PENNSYLVANIA 


Leslie R. Angus, M.D. 
Robert Devereux, M.D. 
Ruth E. Duffy, M.D. 
Herbert H. Herskovitz, M.D. 
J. Clifford Scott, M.D. 
Calvin F. Settlage, M.D. 
Ruth Stephenson, M.D. 


PSYCHOLOGICAL STAFF 
OF PENNSYLVANIA 


Ed: A. Doll, Ph.D. 
‘ector of Research 


Milton Brutten, Ph.D. 
Michael B. Dunn, A.M. 
Robert G. Ferguson, A.M. 
Edward L. French, Ph.D. 
John R. Kleiser, A.M. 
Mary J. Pawling, A.M. 
M. Eleanor Ross, Litt.M. 


PROFESSIONAL STAFF, 
THE DEVEREUX RANCH 
SCHOOL, CALIFORNIA 


Charles M. Campbell, Jr., M.D. 


Consulting Pediatrician 


Richard H. Lambert, M.D. 
Consulting Psychiatrist 


Ivan A. McGuire, M.D. 
Consulting Psychiatrist 


David L. Reeves, M.D. 
Consulting Neurologist 
a 
Robert L. Brigden, Ph.D. 
Director of Ranch School 


Thomas W. Jefferson, Ph.D. 
Clinical Psychologist 





HAPPINESS 
through Education 
with Therapy 


HAPPY LIFE is still possible for the | 

child who is failing socially and schol- 
astically -- not because of limited intelli- 
gence, but from sheer inability to cope with 
his emotional environment. With individ- 
ualized guidance and thoughtfully directed 
activities, the disturbed child can be re- 
lieved of his anxieties and be assisted with 
his emotional development. He can still . 
become a happy, well-integrated child. 


When, in your practice, you feel that a 
school-age patient needs specialized educa- 
tion and guidance, we invite you to let us 
evaluate the potential outcome. When the 
intelligence is normal, but emotional dis- 
turbances block the ability to learn, our 
experienced staff will carefully study the 
details and offer a considered report. 
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